Feb 16, 2026arXiv:2602.15183

Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel, Rodrigo Toro Icarte

AI Summary

The paper investigates how visual training improves the out-of-distribution (OOD) performance of Vision Language Models (VLMs) on text-only information retrieval tasks. Through a controlled synthetic retrieval task, they demonstrate that VLMs outperform their underlying LLMs in OOD generalization due to a shift in binding strategy induced by visual training. Specifically, visual training disrupts positional shortcuts learned during text-only training, forcing the model to adopt a more robust symbolic binding mechanism.

Key Contribution

Surprisingly, training on visual data can rewire LLMs to perform better on purely text-based reasoning tasks by forcing them to abandon brittle positional shortcuts.

Abstract

Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model's internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic binding mechanism that persists even after text-only examples are reintroduced. We further characterize how binding strategies vary across training regimes, visual encoders, and initializations, and show that analogous shifts occur during pretrained LLM-to-VLM transitions. Our findings suggest that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

Related Papers