Search papers, labs, and topics across Lattice.
The paper investigates text dominance in large audio-language models (LALMs), where models under-utilize audio evidence. Using mechanistic interpretability, the authors identify a small set of "audio-specialist" attention heads that exhibit a "listening" signal, indicating engagement with audio input. They then construct an audio--silence steering direction and apply an inference-time activation intervention to amplify the model's audio effect, improving accuracy on the MMAU benchmark by up to 8.0 percentage points on Qwen-based LALMs.
Forget retraining: Steering a handful of attention heads in audio-language models can boost audio understanding by 8%, revealing a surprisingly simple way to overcome text dominance.
Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To address this issue we use mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening''signal. We show that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting. Leveraging this localization, we construct an audio--silence steering direction and apply an inference-time activation intervention to the final representation, amplifying the model's audio effect. To demonstrate the utility of this intervention, we show on MMAU that this improves accuracy by up to +8.0 percentage points on two Qwen-based LALMs, without any parameter updates.