Search papers, labs, and topics across Lattice.
The paper introduces HyLaR, a framework that interleaves discrete text generation with continuous visual latent representations for multimodal reasoning. To optimize this hybrid discrete-continuous action space, they propose DePO (Decoupled Policy Optimization), which decomposes the policy gradient objective and applies independent trust-region constraints along with a von Mises-Fisher KL regularizer. Experiments show HyLaR outperforms standard MLLMs and other latent reasoning methods on fine-grained perception and multimodal understanding tasks.
Unleashing the full potential of multimodal LLMs requires reasoning directly in the visual latent space, and this paper shows how to do it with stable policy optimization.
Chain-of-Thought (CoT) reasoning significantly elevates the complex problem-solving capabilities of multimodal large language models (MLLMs). However, adapting CoT to vision typically discretizes signals to fit LLM inputs, causing early semantic collapse and discarding fine-grained details. While external tools can mitigate this, they introduce a rigid bottleneck, confining reasoning to predefined operations. Although recent latent reasoning paradigms internalize visual states to overcome these limitations, optimizing the resulting hybrid discrete-continuous action space remains challenging. In this work, we propose HyLaR (Hybrid Latent Reasoning), a framework that seamlessly interleaves discrete text generation with continuous visual latent representations. Specifically, following an initial cold-start supervised fine-tuning (SFT), we introduce DePO (Decoupled Policy Optimization) to enable effective reinforcement learning within this hybrid space. DePO decomposes the policy gradient objective, applying independent trust-region constraints to the textual and latent components, alongside an exact closed-form von Mises-Fisher (vMF) KL regularizer. Extensive experiments demonstrate that HyLaR outperforms standard MLLMs and state-of-the-art latent reasoning approaches across fine-grained perception and general multimodal understanding benchmarks. Code is available at https://github.com/EthenCheng/HyLaR.