College of Computer and Data ScienceHUSTIA CASNUDTQLUSchool of Artificial IntelligenceSEUTexas A&MTJUTongjiZJUOct 27, 2025

Collaboration Wins More: Dual-Modal Collaborative Attention Reinforcement for Mitigating Large Vision Language Models Hallucination

Jiye Xie, Yifei Gao, Liangliang You, Xian-ming Xu, Haoran Xu, Zhiqiang Kou, Kexue Fu, Youyang Qu, Wenjie Yang, Jianwei Guo, Weiliang Meng, Longxiang Gao, Haoran Yang, Changwei Wang, Yu Zhang

AI Summary

The paper addresses the problem of hallucination in Large Vision-Language Models (LVLMs) by proposing a Dual-Modal Collaborative Attention Reinforcement (DuCAR) method. DuCAR uses intra-visual CLS-driven sampling and cross-modal dynamic sampling to extract important visual tokens, and then adaptively enhances the attention weights of these tokens during multimodal fusion. Experiments on POPE and CHAIR benchmarks demonstrate that DuCAR outperforms existing methods in mitigating hallucinations.

Key Contribution

By jointly reinforcing informative visual tokens and suppressing irrelevant ones, DuCAR significantly reduces hallucinations in LVLMs, outperforming prior single-modality focused approaches.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in visual-language understanding for downstream multimodal tasks. However, these models often generate descriptions containing objects or details not present in the input image, a phenomenon commonly referred to as ''hallucination''. Existing methods focus solely on single-side hallucination mitigation: Intra-modal-only reinforcement (e.g. visual attention enhancement) ignores prompt-based guidance; Inter-modal-only correlation correction may introduce low-information visual tokens to mislead reasoning. To tackle this challenge, we propose Dual-Modal Collaborative Attention Reinforcement (DuCAR). Specifically, DuCAR is equipped with intra-visual CLS-driven sampling and cross-modal dynamic sampling, extracting important visual tokens guided by intra- and inter-modal joint information. During the multimodal fusion stage, DuCAR adaptively enhances the attention weights of these visual tokens. Our sampling and enhancement strategies in DuCAR simultaneously reinforces informative visual tokens, and suppresses attention dispersion towards question-irrelevant visual information. We conduct extensive experiments on the POPE and CHAIR hallucination benchmarks, demonstrating that our method outperforms existing state-of-the-art mitigation baselines and effectively reduces hallucinations in text generated by LVLMs. The code is available in the https://github.com/xjy2020/DuCAR.

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Citation Metrics

Citations2

Influential citations0

References7

Year2025

VenueACM Multimedia

Related Papers

Finding related papers...

Search

Collaboration Wins More: Dual-Modal Collaborative Attention Reinforcement for Mitigating Large Vision Language Models Hallucination

Related Papers