Search papers, labs, and topics across Lattice.
This paper introduces EVA (LatEnt Visual StAtes), a framework that generates continuous latent visual representations to improve multimodal reasoning by replacing the traditional reliance on discrete outputs. By employing adaptive Latent_slot tokens that serve as intermediate visual thoughts, the authors achieve co-optimization with discrete text tokens, leading to notable performance improvements and reduced latency. The introduction of D-GSPO addresses policy deviations during the reasoning process, and extensive experiments demonstrate EVA's effectiveness across various benchmarks, highlighting its potential for enhanced inference efficiency.
Continuous latent visual representations can drastically improve multimodal reasoning efficiency and performance, outpacing traditional discrete output methods.
The integration of visual evidence has significantly enhanced the capabilities of large multimodal models. However, this integration predominantly relies on generating discrete outputs (etc., code or box coordinates) to invoke external tools, a process that introduces rigid dependencies and substantial latency. To overcome these limitations, we propose {EVA} (LatEnt Visual StAtes), a novel framework that natively generates continuous latent visual representations. These internal representations manifest as an adaptive sequence of Latent\_slot tokens, serving as intermediate visual thoughts during the reasoning process. These Latent\_slot tokens are then trained end-to-end with the discrete text tokens. This co-optimization, notably, causes extreme policy deviation in the 'transition window' following the Latent\_slot tokens. We develop D-GSPO (Decouple-GSPO) to target this root cause by decoupling the optimization of latent and discrete components. To support SFT, we construct EVA-230K, a high-quality text-image interleaved CoT dataset encompassing a diverse range of real-world scenes, documents, charts and OCR tasks. Extensive experiments across multiple benchmarks confirm that EVA achieves significant performance gains while enhancing inference efficiency.