HITApr 30, 2026arXiv:2604.28192

LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models

Hao Chen, Jiaming Liu, Jiaming Liu, Zhonghao Yan, Zhonghao Yan, Nuowei Han, Nuowei Han, Renrui Zhang, Renrui Zhang, Chenyang Gu, Chenyang Gu, Jialin Gao, Ziyu Guo, Siyuan Qian, Siyuan Qian, Yinxi Wang, Peng Jia, Peng Jia, Chi-Wing Fu, Shanghang Zhang, Pheng-Ann Heng, Pheng-Ann Heng

AI Summary

LaST-R1 is introduced as a VLA framework that integrates latent Chain-of-Thought (CoT) reasoning over physical dynamics with a tailored RL post-training paradigm to improve robotic manipulation. A key component is Latent-to-Action Policy Optimization (LAPO), an RL algorithm that jointly optimizes latent reasoning and action generation, along with an adaptive latent CoT mechanism for dynamic reasoning horizon adjustment. Experiments on the LIBERO benchmark and real-world deployments demonstrate LaST-R1's superior performance, achieving a 99.8% success rate in simulation and up to 44% improvement in real-world tasks compared to prior methods.

Key Contribution

Forget static imitation learning: LaST-R1 unlocks near-perfect robotic manipulation (99.8% success) by adaptively reasoning about physical dynamics *before* acting, then refining with RL.

Abstract

Vision-Language-Action (VLA) models have increasingly incorporated reasoning mechanisms for complex robotic manipulation. However, existing approaches share a critical limitation: whether employing explicit linguistic reasoning that suffers from latency and discretization, or utilizing more expressive continuous latent reasoning, they are predominantly confined to static imitation learning that limits adaptability and generalization. While online reinforcement learning (RL) has been introduced to VLAs to enable trial-and-error exploration, current methods exclusively optimize the vanilla action space, bypassing the underlying physical reasoning process. In this paper, we present \textbf{LaST-R1}, a unified VLA framework that integrates latent Chain-of-Thought (CoT) reasoning over physical dynamics prior to action execution, along with a tailored RL post-training paradigm. Specifically, we propose \textbf{Latent-to-Action Policy Optimization (LAPO)}, a novel RL algorithm that jointly optimizes the latent reasoning process and the action generation. By bridging reasoning and control, LAPO improves the representation of physical world modeling and enhances robustness in interactive environments. Furthermore, an \textbf{adaptive latent CoT mechanism} is introduced to allow the policy to dynamically adjust its reasoning horizon based on environment complexity. Extensive experiments show that LaST-R1 achieves a near-perfect 99.8\% average success rate on the LIBERO benchmark with only one-shot supervised warm-up, significantly improving convergence speed and performance over prior state-of-the-art methods. In real-world deployments, LAPO post-training yields up to a 44\% improvement over the initial warm-up policy across four complex tasks, including both single-arm and dual-arm settings. Finally, LaST-R1 demonstrates strong generalization across simulated and real-world environments.

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models

Related Papers