Apr 2, 2026arXiv:2604.02097

LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

Jiachun Jin, Zetong Zhou, Xiao Yang, Hao Zhang, Pengfei Liu, Pengfei Liu, Jun Zhu, Jun-Yan Zhu, Zhijie Deng

AI Summary

LatentUM, a novel unified model, is introduced to represent all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This shared latent space enables flexible interleaved cross-modal reasoning and generation, improving computational efficiency and cross-modal alignment. LatentUM achieves state-of-the-art performance on Visual Spatial Planning, enhances visual generation via self-reflection, and supports world modeling by predicting future visual states in the shared latent space.

Key Contribution

Ditching pixel-space translation unlocks a unified model (LatentUM) that reasons across modalities with SOTA results, opening doors to more efficient and aligned visual AI.

Abstract

Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References84

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

Related Papers