Feb 17, 2026arXiv:2602.15580

How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning

AI Summary

This paper introduces PID Flow, a pipeline combining dimensionality reduction, normalizing-flow Gaussianization, and closed-form Gaussian PID estimation, to decompose the predictive information in multimodal Transformers layer-by-layer into redundant, vision-unique, language-unique, and synergistic components. Applying this framework to LLaVA models on GQA, the authors find a consistent "modal transduction" pattern where visual-unique information peaks early and decays, language-unique information surges late, and cross-modal synergy remains low. Through targeted attention knockouts, they establish a causal link between this transduction pathway and the model's reasoning process, demonstrating how disrupting the pathway affects information flow and task performance.

Key Contribution

Multimodal LLMs primarily rely on language-unique information for final predictions, with visual information decaying across layers and cross-modal synergy remaining surprisingly low (under 2%).

Abstract

When a multimodal Transformer answers a visual question, is the prediction driven by visual evidence, linguistic reasoning, or genuinely fused cross-modal computation -- and how does this structure evolve across layers? We address this question with a layer-wise framework based on Partial Information Decomposition (PID) that decomposes the predictive information at each Transformer layer into redundant, vision-unique, language-unique, and synergistic components. To make PID tractable for high-dimensional neural representations, we introduce \emph{PID Flow}, a pipeline combining dimensionality reduction, normalizing-flow Gaussianization, and closed-form Gaussian PID estimation. Applying this framework to LLaVA-1.5-7B and LLaVA-1.6-7B across six GQA reasoning tasks, we uncover a consistent \emph{modal transduction} pattern: visual-unique information peaks early and decays with depth, language-unique information surges in late layers to account for roughly 82\% of the final prediction, and cross-modal synergy remains below 2\%. This trajectory is highly stable across model variants (layer-wise correlations $>$0.96) yet strongly task-dependent, with semantic redundancy governing the detailed information fingerprint. To establish causality, we perform targeted Image$\rightarrow$Question attention knockouts and show that disrupting the primary transduction pathway induces predictable increases in trapped visual-unique information, compensatory synergy, and total information cost -- effects that are strongest in vision-dependent tasks and weakest in high-redundancy tasks. Together, these results provide an information-theoretic, causal account of how vision becomes language in multimodal Transformers, and offer quantitative guidance for identifying architectural bottlenecks where modality-specific information is lost.

Interpretability & Mechanistic Interp Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning

Related Papers