Search papers, labs, and topics across Lattice.
The paper investigates why multimodal LLMs fail to fully utilize non-textual information, even when that information is demonstrably present in intermediate layers. It argues that the decoder, trained primarily on text, is the bottleneck, as it is only capable of extracting information aligned with its text-centric training distribution. The authors formalize this limitation using a Generalized Mutual Information (GMI) bound and empirically validate it across multiple models and modalities, showing that improving text-alignment in the encoder or directly training the decoder to utilize specific modalities improves performance.
MLLMs aren't blind and deaf because they can't see or hear, but because their text-trained decoders are ignoring most of what the encoders pass along.
Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encoding: speaker identity, emotion, and visual attributes survive through every LLM layer (3--55$\times$ above chance in linear probes), yet removing 64--71% of modality-specific variance improves decoder loss. The decoder has no learned use for these directions; their presence is noise. We formalize this as a mismatched decoder problem: a decoder trained on text can only extract information along text-aligned directions. Accessible information is bounded by the Generalized Mutual Information (GMI), with degradation scaling with distributional distance and decoder sensitivity. The bound is a property of the decoder's scoring rule, not of any particular architecture; it applies whether non-text inputs arrive through a learned projection, a discrete codebook, or no explicit adapter at all. We validate this across five models spanning speech and vision. A controlled experiment (two Prismatic VLMs differing only in encoder text-alignment) confirms the bottleneck is the decoder's scoring rule, not the encoder or projection. A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility ($+$7.5%) without affecting other attributes, confirming that the training objective determines what becomes accessible.