Search papers, labs, and topics across Lattice.
The paper introduces Caption-guided Visual Attention Steering (CAST), a training-free method to mitigate object hallucination in LVLMs by leveraging the enhanced visual attention observed during caption queries. CAST identifies attention heads sensitive to caption queries and steers their outputs to strengthen fine-grained visual perception. Experiments across five LVLMs and five benchmarks show CAST reduces object hallucination by 6.03% on average, achieving SOTA performance with minimal overhead.
Steer LVLMs' attention with caption guidance and watch object hallucinations drop by 6%鈥攏o training required.
Although Large Vision-Language Models (LVLMs) have demonstrated remarkable performance on downstream tasks, they frequently produce contents that deviate from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or decoding strategies which significantly increase inference time. In this work, we observe that LVLMs'attention to visual information is significantly enhanced when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-guided Visual Attention Steering (CAST), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern corresponding to caption queries to enhance LVLMs'visual perception capability. Specifically, we use probing techniques to identify attention heads that are highly sensitive to caption queries and estimate optimized steering directions for their outputs. This steering strengthens LVLM's fine-grained visual perception capabilities, thereby effectively mitigating object hallucination. CAST reduced object hallucination by an average of 6.03% across five widely used LVLMs and five benchmarks including both discriminative and generative tasks, demonstrating state-of-the-art performance while adding little inference cost and preserving other foundational capabilities.