Stanford HAIApr 15, 2026arXiv:2604.14363

The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

Akshay Paruchuri, Ishan Chatterjee, H. Fuchs, Ehsan Adeli, P. Didyk

AI Summary

The authors investigate why Multimodal Language Models (MLLMs) struggle with visual perception by probing the relative importance of text and visual tokens. They introduce "centroid replacement," collapsing tokens to their K-means centroids, and find that erasing text centroid structure hurts performance significantly more than erasing visual centroids. They then leverage this asymmetry with "text centroid contrastive decoding," which improves accuracy by up to 16.9% by decoding against a text-centroid-erased reference, suggesting a structural imbalance where language overshadows vision.

Key Contribution

MLLMs prioritize language over vision so strongly that you can boost visual reasoning performance by simply scrambling the text tokens' centroids during decoding.

Abstract

Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4$\times$ more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.

Interpretability & Mechanistic Interp Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

Related Papers