Search papers, labs, and topics across Lattice.
Lsi\tilde{s}^{l}=\frac{s^{l}}{\sum_{i=1}^{L}s^{i}} 5: Scaling feature steering intensity λl=λ∗ml+λ⋅s~l\lambda_{l}=\lambda*m_{l}+\lambda\cdot\tilde{s}_{l} 6: Steering feature 𝐡~l=f(𝐡l,λl)\tilde{\mathbf{h}}_{l}=f(\mathbf{h}_{l},\lambda_{l}) 7: return 𝐡l\mathbf{h}_{l} 4 Experiments and Analysis In this section, we empirically investigate the effectiveness of LTS-FS in mitigating hallucinations while preserving model generalization. Remarkably, we use 100 sentence-level hallucination samples and 100 token-level hallucination samples to synthesize the Bi-granularity dataset for layer-wise attribution. The sentence-level hallucination samples are selected and processed from CHAIR benchmark [33], while the token-level hallucination samples are from POPE [22] and Antidote [41]. Table 1: CHAIR results of various LVLMs on MSCOCO. Bold indicates the best performance. CS and CS mean lower hallucination. Recall and output length (Len.) serve as controls, indicating that reductions in CS/CI do not stem from suppressing objects or truncating responses. ∗ denotes the feature steering methods. Method LLaVA-v1.5-, State Key Lab. of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences 3 School of Computer Science and Technology, University of Chinese Academy of Sciences {dangtiantian23, qmhuang}@ucas.ac.cn {bichao,shenshufan22z,liujinzhe23b,wangshuhui}@ict.ac.cn Corresponding author. Abstract Despite the significant advancements in Large Vision-Language Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment. Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs. However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks. In this paper, we propose a plug-and-play framework called Locate-Then-Sparsify for Feature Steering (LTS-FS), which controls the steering intensity according to the hallucination relevance of each layer. We first construct a synthetic dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers. Extensive experiments across multiple LVLMs and benchmarks demonstrate that our LTS-FS framework effectively mitigates hallucination while preserving strong performance. 1 Introduction By harnessing the advanced text generation capabilities of Large Language Models, Large Vision Language Models (LVLMs) have achieved impressive performance across various multimodal tasks [1, 28, 40, 44]. Despite their strong performance, LVLMs face a significant challenge known as hallucination, wherein the model generates fluent and semantically coherent responses that include factually incorrect statements about the input visual content [21, 12, 44]. Such hallucinations hinder the reliability of LVLMs, posing serious risks in real-world applications [16, 34]. (a) TSNE visualizations of features in LVLM layers. (b) Performance on CHAIR and MMMU benchmarks. Figure 1: Current methods (e.g., Nullu [42]) mitigate hallucinations by uniformly steering features across layers, which (a) alters feature distributions and (b) leads to degraded performance on general tasks like MMMU. In contrast, we propose a layerwise steering framework, LTS-FS, which mitigates hallucinations more effectively (e.g., on CHAIR) while minimally perturbing the feature distributions, thus preserving more generalization ability. To mitigate hallucinations in LVLMs, early studies finetune the whole model on specially designed datasets, which is costly and degrades its generalization ability [25, 38, 9]. In contrast, decoding-based methods introduce strategies such as contrastive decoding [19, 2] and self-correction [45, 5] to mitigate hallucinations in a training-free manner, thereby preserving the original capabilities of pre-trained models. Nevertheless, these methods significantly increase the number of decoding steps required for each input query, leading to high inference costs for real-world deployment. Recently, feature steering methods [42, 31] show promise to overcome the above limitations. These methods adjust features of intermediate layers by steering them from their original positions in the feature space toward directions that are less prone to generating hallucination outputs. By modifying only the features without introducing additional decoding steps, feature steering methods can maintain inference costs comparable to those of the original model. However, current methods steer features based on heuristically designed rules [31] (e.g., adjust all layers). These rules overlook the inherent differences across layers in pre-trained models, making the steering process to disturb layers less relevant to hallucinations. The disruption alters the distributions of features(in Fig. 1(b)) and ultimately impairs the model’s generalization ability (in Fig. 1(a)), similar to the tuning-based methods. Therefore, an upgraded method to mitigate hallucinations that can achieve feature steering while preserving the original capabilities of LVLMs is urgently required. In this paper, we propose Locate-Then-Sparsify for Feature Steering (LTS-FS), a plug-and-play framework that effectively mitigates hallucinations while preserving the inherent capabilities of LVLMs. First, we construct a dataset including hallucination samples at two granularities. With this dataset, we locate the hallucination-relevant layers through intervention-based attribution. Guided by the attribution score, we propose a layerwise strategy that selectively steers features in hallucination-relevant layers rather than uniformly adjusting all layers. As shown in Fig. 1, compared with Nullu, a classical feature-steering-based method, our strategy barely disrupts the original feature distribution. Meanwhile, the evaluation results on the MMMU benchmark demonstrate that our LTS-FS not only maintains fewer hallucinatory expressions but also achieves better generalization performance. Specifically, for dataset construction, we first distinguish hallucinations in LVLMs according to token-level and sentence-level granularities. Then, we construct hallucination samples at token and sentence granularity levels to build a dataset. Supported by this dataset, we locate hallucination-relevant layers through an attribution method based on causal interventions. This method sequentially masks the attention output of each layer to assess its contribution to the logits of hallucination outputs. Based on the contribution, we define attribution scores and assign them to each layer, which reflects its relevance to hallucination phenomenon. After obtaining layer-wise attribution scores, we propose a sparsity-aware layer selection and steering strategy that converts the attribution scores into steering intensities (i.e., applying weaker steering to layers with low scores and stronger steering to those with high scores). By modifying only hallucination-relevant layers, we mitigate hallucinations while minimizing interference with the model’s feature distribution, thereby more effectively preserving its original capabilities. We conduct extensive experiments to demonstrate that LTS-FS can further improve the hallucination mitigation capacity of current SOTA feature steering methods (e.g., 2% accuracy gain on POPE-popular with Qwen-VL-2.5-
1
0
2
Stop blindly steering all layers of your LVLM - this new method uses attribution to apply targeted interventions only where hallucinations originate, preserving performance on general tasks.