Search papers, labs, and topics across Lattice.
This paper introduces CriticVLA, a two-stage framework for autonomous driving that leverages Vision Language Action (VLA) models for both trajectory generation and subsequent refinement. The key innovation is using the VLA as a critic to evaluate and optimize initial trajectories, guided by a newly constructed 12.9 million trajectory dataset. Experiments on Bench2Drive demonstrate a significant performance boost, achieving a 73.33% success rate and a 30% improvement in challenging scenarios compared to existing VLA-based methods.
Autonomous driving gets a 30% performance boost in challenging scenarios by having VLAs critique and refine their own driving plans.
Recent advances in vision language action (VLA) models have shown remarkable potential for autonomous driving by directly mapping multimodal inputs to control signals. However, previous VLA-based methods have not explicitly exploited the critic capability of VLAs to refine driving decisions, even though such capability has been well demonstrated in other LLM-based domains, thereby limiting their performance in complex closed-loop scenarios. In this work, we present a theoretically inspired two-stage framework, CriticVLA, which extends the role of VLAs from acting to judging. CriticVLA first generates a rough trajectory and then refines it through multimodal evaluation and single-step optimization guided by a VLA-based critic, yielding higher-quality driving behaviors. To support this process, we construct a large-scale synthetic dataset of 12.9 million annotated trajectories covering diverse driving scenarios, which enhances the critic's reasoning and refinement abilities. Extensive closed-loop experiments on the Bench2Drive benchmark show that CriticVLA significantly surpasses state-of-the-art baselines, achieving a 73.33% total success rate and delivering about 30% improvement in challenging scenarios.