Apr 2, 2026arXiv:2604.01989

Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

Bo Gong, Boyang Gong, Yu Zheng, Yujin Zheng, Fanye Kong, Jiwen Lu

AI Summary

The authors identify "visual inertia" in MLLMs, where attention patterns become static early in decoding, hindering compositional understanding and leading to cognitive hallucinations. They find that attention to critical regions remains persistently focused, preventing dynamic relational inference. To address this, they propose Inertia-aware Visual Excitation (IVE), a training-free method that selects dynamically emerging visual tokens and penalizes inertial attention, improving performance on cognitive hallucination benchmarks across various MLLMs.

Key Contribution

MLLMs suffer from "visual inertia" where attention gets stuck early, but a simple training-free intervention can break this inertia and significantly reduce cognitive hallucinations.

Abstract

Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

Related Papers