Search papers, labs, and topics across Lattice.
The paper addresses the problem of hallucination in Large Vision-Language Models (LVLMs) by proposing a Dual-Modal Collaborative Attention Reinforcement (DuCAR) method. DuCAR uses intra-visual CLS-driven sampling and cross-modal dynamic sampling to extract important visual tokens, and then adaptively enhances the attention weights of these tokens during multimodal fusion. Experiments on POPE and CHAIR benchmarks demonstrate that DuCAR outperforms existing methods in mitigating hallucinations.
By jointly reinforcing informative visual tokens and suppressing irrelevant ones, DuCAR significantly reduces hallucinations in LVLMs, outperforming prior single-modality focused approaches.
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in visual-language understanding for downstream multimodal tasks. However, these models often generate descriptions containing objects or details not present in the input image, a phenomenon commonly referred to as ''hallucination''. Existing methods focus solely on single-side hallucination mitigation: Intra-modal-only reinforcement (e.g. visual attention enhancement) ignores prompt-based guidance; Inter-modal-only correlation correction may introduce low-information visual tokens to mislead reasoning. To tackle this challenge, we propose Dual-Modal Collaborative Attention Reinforcement (DuCAR). Specifically, DuCAR is equipped with intra-visual CLS-driven sampling and cross-modal dynamic sampling, extracting important visual tokens guided by intra- and inter-modal joint information. During the multimodal fusion stage, DuCAR adaptively enhances the attention weights of these visual tokens. Our sampling and enhancement strategies in DuCAR simultaneously reinforces informative visual tokens, and suppresses attention dispersion towards question-irrelevant visual information. We conduct extensive experiments on the POPE and CHAIR hallucination benchmarks, demonstrating that our method outperforms existing state-of-the-art mitigation baselines and effectively reduces hallucinations in text generated by LVLMs. The code is available in the https://github.com/xjy2020/DuCAR.