Apr 30, 2026arXiv:2604.27366

Judge, Then Drive: A Critic-Centric Vision Language Action Framework for Autonomous Driving

Lijin Yang, Jian-Zhang Huang, Jianing Huang, Zhongzhan Huang, Shu Liu, Hao Yang

AI Summary

This paper introduces CriticVLA, a two-stage framework for autonomous driving that leverages Vision Language Action (VLA) models for both trajectory generation and subsequent refinement. The key innovation is using the VLA as a critic to evaluate and optimize initial trajectories, guided by a newly constructed 12.9 million trajectory dataset. Experiments on Bench2Drive demonstrate a significant performance boost, achieving a 73.33% success rate and a 30% improvement in challenging scenarios compared to existing VLA-based methods.

Key Contribution

Autonomous driving gets a 30% performance boost in challenging scenarios by having VLAs critique and refine their own driving plans.

Abstract

Recent advances in vision language action (VLA) models have shown remarkable potential for autonomous driving by directly mapping multimodal inputs to control signals. However, previous VLA-based methods have not explicitly exploited the critic capability of VLAs to refine driving decisions, even though such capability has been well demonstrated in other LLM-based domains, thereby limiting their performance in complex closed-loop scenarios. In this work, we present a theoretically inspired two-stage framework, CriticVLA, which extends the role of VLAs from acting to judging. CriticVLA first generates a rough trajectory and then refines it through multimodal evaluation and single-step optimization guided by a VLA-based critic, yielding higher-quality driving behaviors. To support this process, we construct a large-scale synthetic dataset of 12.9 million annotated trajectories covering diverse driving scenarios, which enhances the critic's reasoning and refinement abilities. Extensive closed-loop experiments on the Bench2Drive benchmark show that CriticVLA significantly surpasses state-of-the-art baselines, achieving a 73.33% total success rate and delivering about 30% improvement in challenging scenarios.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Judge, Then Drive: A Critic-Centric Vision Language Action Framework for Autonomous Driving

Related Papers