Corresponding authors: Yixuan Yuan ()Ningbo NoPolyUMar 17, 2026arXiv:2603.16372

InViC: Intent-aware Visual Cues for Medical Visual Question Answering

Zhisong Wang, Ziyang Chen, Zanting Ye, Hongze Zhu, Yefeng Zheng, Yong Xia

AI Summary

The paper introduces Intent-aware Visual Cues (InViC), a plug-in framework for medical visual question answering (Med-VQA) designed to improve the reliance of multimodal LLMs on visual evidence. InViC uses a Cue Tokens Extraction (CTE) module to distill question-conditioned visual tokens and injects them into the LLM decoder, coupled with a two-stage fine-tuning strategy that initially bottlenecks visual information through the cue pathway. Experiments on VQA-RAD, SLAKE, and ImageCLEF VQA-Med 2019 show that InViC consistently improves performance over zero-shot inference and standard LoRA fine-tuning by encouraging the models to attend to relevant visual cues.

Key Contribution

Stop your Med-VQA model from "hallucinating" answers: this plug-in framework forces LLMs to actually *look* at the image by bottlenecking visual information through question-conditioned cues.

Abstract

Medical visual question answering (Med-VQA) aims to answer clinically relevant questions grounded in medical images. However, existing multimodal large language models (MLLMs) often exhibit shortcut answering, producing plausible responses by exploiting language priors or dataset biases while insufficiently attending to visual evidence. This behavior undermines clinical reliability, especially when subtle imaging findings are decisive. We propose a lightweight plug-in framework, termed Intent-aware Visual Cues (InViC), to explicitly enhance image-based answer generation in medical VQA. InViC introduces a Cue Tokens Extraction (CTE) module that distills dense visual tokens into a compact set of K question-conditioned cue tokens, which serve as structured visual intermediaries injected into the LLM decoder to promote intent-aligned visual evidence. To discourage bypassing of visual information, we further design a two-stage fine-tuning strategy with a cue-bottleneck attention mask. In Stage I, we employ an attention mask to block the LLM's direct view of raw visual features, thereby funneling all visual evidence through the cue pathway. In Stage II, standard causal attention is restored to train the LLM to jointly exploit the visual and cue tokens. We evaluate InViC on three public Med-VQA benchmarks (VQA-RAD, SLAKE, and ImageCLEF VQA-Med 2019) across multiple representative MLLMs. InViC consistently improves over zero-shot inference and standard LoRA fine-tuning, demonstrating that intent-aware visual cues with bottlenecked training is a practical and effective strategy for improving trustworthy Med-VQA.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

InViC: Intent-aware Visual Cues for Medical Visual Question Answering

Related Papers