Mar 31, 2026arXiv:2603.29676

A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

AI Summary

This paper introduces a Partial Information Decomposition (PID) framework to dissect the information processing within Large Vision-Language Models (LVLMs), quantifying the redundancy, uniqueness, and synergy of visual and linguistic inputs. Applying this framework to 26 LVLMs across various datasets and training stages reveals two distinct task regimes (synergy-driven vs. knowledge-driven) and two family-level strategies (fusion-centric vs. language-centric). The analysis also identifies visual instruction tuning as the critical phase for learning multimodal fusion, providing a more nuanced understanding of LVLM decision-making beyond simple accuracy metrics.

Key Contribution

LVLMs aren't all that glitters: a new information-theoretic analysis reveals that some lean heavily on language priors while others genuinely fuse vision and language.

Abstract

Large vision-language models (LVLMs) achieve impressive performance, yet their internal decision-making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the "information spectrum" of LVLMs -- decomposing a model's decision-relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model-agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions -- breadth (cross-model & cross-task), depth (layer-wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and (ii) two stable, contrasting family-level strategies (fusion-centric vs. language-centric). We also uncover a consistent three-phase pattern in layer-wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs. Code and data are available at https://github.com/RiiShin/pid-lvlm-analysis .

Interpretability & Mechanistic Interp Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

Related Papers