HuaweiPolyUSYSUYinwang Intelligent Technology Co. LtdJun 23, 2026arXiv:2606.24233

Latent Visual States for Efficient Multimodal Reasoning

Xiuwei Chen, Wentao Hu, Yongxin Wang, Zisheng Chen, Likui Zhang, Kun Xiang, Jianhua Han, Hui-Ling Zhen, Jingyuan Zou, Hang Xu, Xiaodan Liang

AI Summary

This paper introduces EVA (LatEnt Visual StAtes), a framework that generates continuous latent visual representations to improve multimodal reasoning by replacing the traditional reliance on discrete outputs. By employing adaptive Latent_slot tokens that serve as intermediate visual thoughts, the authors achieve co-optimization with discrete text tokens, leading to notable performance improvements and reduced latency. The introduction of D-GSPO addresses policy deviations during the reasoning process, and extensive experiments demonstrate EVA's effectiveness across various benchmarks, highlighting its potential for enhanced inference efficiency.

Key Contribution

Continuous latent visual representations can drastically improve multimodal reasoning efficiency and performance, outpacing traditional discrete output methods.

Abstract

The integration of visual evidence has significantly enhanced the capabilities of large multimodal models. However, this integration predominantly relies on generating discrete outputs (etc., code or box coordinates) to invoke external tools, a process that introduces rigid dependencies and substantial latency. To overcome these limitations, we propose {EVA} (LatEnt Visual StAtes), a novel framework that natively generates continuous latent visual representations. These internal representations manifest as an adaptive sequence of Latent\_slot tokens, serving as intermediate visual thoughts during the reasoning process. These Latent\_slot tokens are then trained end-to-end with the discrete text tokens. This co-optimization, notably, causes extreme policy deviation in the 'transition window' following the Latent\_slot tokens. We develop D-GSPO (Decouple-GSPO) to target this root cause by decoupling the optimization of latent and discrete components. To support SFT, we construct EVA-230K, a high-quality text-image interleaved CoT dataset encompassing a diverse range of real-world scenes, documents, charts and OCR tasks. Extensive experiments across multiple benchmarks confirm that EVA achieves significant performance gains while enhancing inference efficiency.

Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Latent Visual States for Efficient Multimodal Reasoning

Related Papers