Search papers, labs, and topics across Lattice.
This study uncovers a significant failure mode in multimodal large language models (MLLMs) known as spatial lexical bias, where the introduction of a spatial relation word in answer options can skew model decisions towards incorrect choices. By analyzing nine open-weight MLLMs, the authors reveal that while models can correctly answer binary spatial questions, they often falter when a third option is introduced, indicating a reliance on language rather than visual information. Mechanistic interpretability tools pinpoint the bias to specific channels and neurons within the LLMs, leading to a successful mitigation strategy that improves accuracy across various datasets by up to 100 points.
Adding just one spatial word can lead MLLMs to consistently choose the wrong answer, revealing a critical vulnerability in their reasoning processes.
Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer options can attract the model's decision and make the newly added option likely to be selected. Using nine open-weight MLLMs, we show that this phenomenon is widely observed. In particular, models can answer a binary spatial question correctly, yet consistently select an incorrect third spatial option once it is added to the answer set. We isolate such binary-stable but ternary-fragile cases as diagnostic examples and leverage mechanistic interpretability tools, revealing that a substantial part of the failure instead originates on the language side rather than the visual side: visual attention analyses and residual-stream probes show the correct spatial relation remains internally available on these failures, while irrelevant-option controls, activation patching, and sparse component interventions trace the bias to specific LLM-side channels and neurons. Based on this finding, we show that a lightweight LLM-only DPO update on tiny single-object-pair synthetic data mitigates the bias, lifting four-way robust accuracy by up to 100 points on synthetic data, and by 68.0, 32.6, and 20.1 points on broader evaluation datasets WhatsUp, SpatialMQA-Direct, and VSR.