Search papers, labs, and topics across Lattice.
The paper introduces MindDriver, a progressive multimodal reasoning framework for autonomous driving that leverages vision-language models (VLMs) to bridge the gap between semantic understanding and physical trajectory planning. It addresses limitations of Chain-of-Thought (CoT) approaches by incorporating future image prediction with planning-oriented objective guidance. The framework is trained using a feedback-guided data annotation pipeline and a progressive reinforcement fine-tuning method, demonstrating improved performance in both open-loop and closed-loop driving evaluations.
Autonomous driving gets a human-like reasoning boost: MindDriver uses progressive multimodal reasoning to bridge the gap between semantic understanding and physical trajectory planning.
Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning. To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuning method to optimize the alignment through progressive high- level reward-based learning. MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation. Codes are available at https://github.com/hotdogcheesewhite/MindDriver.