TongjiUW-MadisonApr 20, 2026arXiv:2604.17941

From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models

AI Summary

This paper introduces HONES, a gradient-free framework for identifying and steering task-critical neurons in multi-task vision-language models (VLMs). HONES ranks feed-forward network (FFN) neurons based on their causal write-in contributions conditioned on task-relevant attention heads, addressing the limitations of existing methods that analyze neurons in isolation and neglect task-dependent information pathways. Experiments across four multimodal tasks demonstrate that HONES outperforms existing methods in identifying task-critical neurons and improves model performance after steering.

Key Contribution

Task-aware neuron steering in VLMs is now possible without gradients, unlocking better performance and interpretability across diverse multimodal tasks.

Abstract

Recent work has increasingly explored neuron-level interpretation in vision-language models (VLMs) to identify neurons critical to final predictions. However, existing neuron analyses generally focus on single tasks, limiting the comparability of neuron importance across tasks. Moreover, ranking strategies tend to score neurons in isolation, overlooking how task-dependent information pathways shape the write-in effects of feed-forward network (FFN) neurons. This oversight can exacerbate neuron polysemanticity in multi-task settings, introducing noise into the identification and intervention of task-critical neurons. In this study, we propose HONES (Head-Oriented Neuron Explanation & Steering), a gradient-free framework for task-aware neuron attribution and steering in multi-task VLMs. HONES ranks FFN neurons by their causal write-in contributions conditioned on task-relevant attention heads, and further modulates salient neurons via lightweight scaling. Experiments on four diverse multimodal tasks and two popular VLMs show that HONES outperforms existing methods in identifying task-critical neurons and improves model performance after steering. Our source code is released at: https://github.com/petergit1/HONES.

Interpretability & Mechanistic Interp Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models

Related Papers