Search papers, labs, and topics across Lattice.
The paper introduces AdaIAT, a method to reduce hallucinations in Large Vision-Language Models (LVLMs) by adaptively increasing attention to generated text tokens. AdaIAT leverages the insight that real object tokens attend more to generated text than hallucinated ones, using a layer-wise threshold to control intervention and fine-grained amplification. Experiments on LLaVA-1.5 show AdaIAT reduces hallucination rates by 35.8% and 37.1% while preserving linguistic performance.
Stop LVLMs from making things up: AdaIAT slashes hallucination rates by over 35% by cleverly boosting attention to the *right* text tokens.
Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions. To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions. To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head. Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates $C_S$ and $C_I$ on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.