HohaiXidianFeb 17, 2026arXiv:2602.15556

Revealing and Enhancing Core Visual Regions: Harnessing Internal Attention Dynamics for Hallucination Mitigation in LVLMs

Guangtao Lyu, Chenghao Xu, Jiexi Yan, Muli Yang, Xueting Li, Fen Fang

AI Summary

This paper investigates the attention sink phenomenon in Large Vision-Language Models (LVLMs) and demonstrates that Positive Attention Dynamics (PAD) can reveal semantically core visual regions even under attention sink distortions. They introduce Positive Attention Dynamics Enhancement (PADE), a training-free attention intervention method that uses a PAD map to identify core visual regions, adaptively controls intervention strength with per-head Median Absolute Deviation Scaling, and maintains instruction following with System-Token Compensation. Experiments across multiple LVLMs and benchmarks show that PADE improves visual grounding and reduces hallucinations, confirming the utility of internal attention dynamics for reliable multimodal reasoning.

Key Contribution

LVLMs already highlight the right image regions, you just need to amplify their "Positive Attention Dynamics" to cut through the noise and reduce hallucinations.

Abstract

LVLMs have achieved strong multimodal reasoning capabilities but remain prone to hallucinations, producing outputs inconsistent with visual inputs or user instructions. Existing training-free methods, including contrastive decoding and auxiliary expert models, which incur several times more computational overhead and may introduce potential interference, as well as static internal signal enhancement, are often vulnerable to the attention sink phenomenon. We find that internal Positive Attention Dynamics (PAD) in LVLMs naturally reveal semantically core visual regions under the distortions of attention sinks. Based on this, we propose Positive Attention Dynamics Enhancement (PADE), a training-free attention intervention that constructs a PAD map to identify semantically core visual regions, applies per-head Median Absolute Deviation Scaling to adaptively control the intervention strength, and leverages System-Token Compensation to maintain attention to complex user instructions and support long-term output consistency. Experiments on multiple LVLMs and benchmarks show that PADE improves visual grounding and reduces hallucinations, validating the effectiveness of leveraging internal attention dynamics for reliable multimodal reasoning.

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Revealing and Enhancing Core Visual Regions: Harnessing Internal Attention Dynamics for Hallucination Mitigation in LVLMs

Related Papers