Search papers, labs, and topics across Lattice.
This paper introduces an information-theoretic probing framework based on entropy analysis to diagnose the failure of unified multimodal models (UMMs) to effectively combine LLM reasoning with vision model generation. The analysis of ten UMMs reveals "pseudo-unification" arising from modality-asymmetric encoding (different entropy trajectories for vision and language) and pattern-split response (high-entropy text generation vs. low-entropy image synthesis). The study demonstrates that genuine multimodal synergy requires consistent information flow, evidenced by models using contextual prediction to achieve better reasoning-based text-to-image generation.
Unified multimodal models aren't truly unified: vision and language modalities exhibit divergent entropy patterns during encoding and generation, hindering effective reasoning-based image synthesis.
Unified multimodal models (UMMs) were designed to combine the reasoning ability of large language models (LLMs) with the generation capability of vision models. In practice, however, this synergy remains elusive: UMMs fail to transfer LLM-like reasoning to image synthesis and exhibit divergent response behaviors. We term this phenomenon pseudo-unification. Diagnosing its internal causes is important, but existing probing methods either lack model-internal insight or ignore prompt-response dependencies. To address these limitations, we propose an information-theoretic probing framework that jointly analyzes how UMMs encode inputs and generate outputs. Applied to ten representative UMMs, our framework reveals that pseudo-unification stems from a dual divergence: (i) Modality-Asymmetric Encoding, where vision and language follow different entropy trajectories, and (ii) Pattern-Split Response, where text generation exhibits high-entropy creativity while image synthesis enforces low-entropy fidelity. Only models that unify both sides (e.g., via contextual prediction) achieve more genuine unification, enabling stronger reasoning-based text-to-image generation even with fewer parameters. Our work provides the first model-internal probing of unification, demonstrating that real multimodal synergy requires consistency in information flow, not just shared parameters.