Microsoft ResearchGalbotTU EindhovenAug 7, 2025arXiv:2508.06571

IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model

Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, Jijun Wang, Zichong Gu, Hao Jiang, Li Sun

AI Summary

The paper introduces IRL-VLA, a novel framework for training Vision-Language-Action (VLA) models for autonomous driving that addresses limitations of open-loop imitation learning and reliance on high-fidelity simulations. IRL-VLA employs a three-stage approach: pretraining a VLA policy via imitation learning, constructing a lightweight reward world model via inverse reinforcement learning, and refining the policy with reward world model guided reinforcement learning using PPO. The method achieves state-of-the-art performance on the NAVSIM v2 benchmark and secured 1st runner up in CVPR2025 Autonomous Grand Challenge, demonstrating its effectiveness in balancing safety, comfort, and efficiency in autonomous driving.

Key Contribution

Ditch the high-fidelity simulator: IRL-VLA uses a lightweight reward world model trained with inverse reinforcement learning to enable efficient and effective closed-loop RL training for autonomous driving.

Abstract

Vision-Language-Action (VLA) models have demonstrated potential in autonomous driving. However, two critical challenges hinder their development: (1) Existing VLA architectures are typically based on imitation learning in open-loop setup which tends to capture the recorded behaviors in the dataset, leading to suboptimal and constrained performance, (2) Close-loop training relies heavily on high-fidelity sensor simulation, where domain gaps and computational inefficiencies pose significant barriers. In this paper, we introduce IRL-VLA, a novel close-loop Reinforcement Learning via \textbf{I}nverse \textbf{R}einforcement \textbf{L}earning reward world model with a self-built VLA approach. Our framework proceeds in a three-stage paradigm: In the first stage, we propose a VLA architecture and pretrain the VLA policy via imitation learning. In the second stage, we construct a lightweight reward world model via inverse reinforcement learning to enable efficient close-loop reward computation. To further enhance planning performance, finally, we design specialized reward world model guidence reinforcement learning via PPO(Proximal Policy Optimization) to effectively balance the safety incidents, comfortable driving, and traffic efficiency. Our approach achieves state-of-the-art performance in NAVSIM v2 end-to-end driving benchmark, 1st runner up in CVPR2025 Autonomous Grand Challenge. We hope that our framework will accelerate VLA research in close-loop autonomous driving.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations15

Influential citations1

References30

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model

Related Papers