Search papers, labs, and topics across Lattice.
The authors investigate why Multimodal Language Models (MLLMs) struggle with visual perception by probing the relative importance of text and visual tokens. They introduce "centroid replacement," collapsing tokens to their K-means centroids, and find that erasing text centroid structure hurts performance significantly more than erasing visual centroids. They then leverage this asymmetry with "text centroid contrastive decoding," which improves accuracy by up to 16.9% by decoding against a text-centroid-erased reference, suggesting a structural imbalance where language overshadows vision.
MLLMs prioritize language over vision so strongly that you can boost visual reasoning performance by simply scrambling the text tokens' centroids during decoding.
Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4$\times$ more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.