IIT DelhiApr 22, 2026arXiv:2604.20665

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

AI Summary

This paper critiques the current Vision Encoder-Projector-LLM paradigm, arguing that VLMs often bypass visual representation bottlenecks by relying on language priors, leading to "functional blindness." To address this, they introduce the Modality Translation Protocol, an information-theoretic approach that quantifies the "Expense of Seeing" through novel metrics like Toll, Curse, and Fallacy of Seeing. Their analysis reveals a "Divergence Law of Multimodal Scaling," suggesting that visual knowledge bottlenecks worsen with language model scaling, advocating for a shift towards architectures that prioritize genuine multimodal reasoning.

Key Contribution

VLMs are often functionally blind, exploiting language priors instead of truly "seeing" visual data, and this problem paradoxically *worsens* as language models scale.

Abstract

The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore fatally conflates dataset biases with architectural incapacity. We propose a radical, information-theoretic departure: the Modality Translation Protocol, designed to quantifiably unmask the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we posit a provocative Divergence Law of Multimodal Scaling, hypothesising that as the underlying language engines scale to unprecedented reasoning capabilities, the mathematical penalty of the visual knowledge bottleneck paradoxically increases. We challenge the KDD community to abandon the illusory pursuit of"multimodal gain". By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide the rigorous, trustworthy foundation required to force the next generation of AI systems to truly see the data, achieving true multimodal reasoning.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References13

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

Related Papers