McGillPurdueUW-MadisonXiamen UniversityMar 18, 2026arXiv:2603.18315

DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

Zihao Sheng, Zhengyan Wan, Yansong Qu, Junwei You, Sikai Chen

AI Summary

DriveVLM-RL, a novel autonomous driving framework, integrates vision-language models (VLMs) into reinforcement learning (RL) via a dual-pathway architecture inspired by neuroscience. It uses a Static Pathway with CLIP for continuous spatial safety assessment and a Dynamic Pathway with a lightweight detector and VLM for attention-gated multi-frame semantic risk reasoning. The framework achieves significant improvements in collision avoidance, task success, and generalization in CARLA, while crucially removing all VLM components at deployment to ensure real-time feasibility.

Key Contribution

Autonomous vehicles can now leverage the rich semantic understanding of VLMs for safer driving without the computational overhead, thanks to a clever training strategy that distills VLM knowledge into a real-time RL policy.

Abstract

Ensuring safe decision-making in autonomous vehicles remains a fundamental challenge despite rapid advances in end-to-end learning approaches. Traditional reinforcement learning (RL) methods rely on manually engineered rewards or sparse collision signals, which fail to capture the rich contextual understanding required for safe driving and make unsafe exploration unavoidable in real-world settings. Recent vision-language models (VLMs) offer promising semantic understanding capabilities; however, their high inference latency and susceptibility to hallucination hinder direct application to real-time vehicle control. To address these limitations, this paper proposes DriveVLM-RL, a neuroscience-inspired framework that integrates VLMs into RL through a dual-pathway architecture for safe and deployable autonomous driving. The framework decomposes semantic reward learning into a Static Pathway for continuous spatial safety assessment using CLIP-based contrasting language goals, and a Dynamic Pathway for attention-gated multi-frame semantic risk reasoning using a lightweight detector and a large VLM. A hierarchical reward synthesis mechanism fuses semantic signals with vehicle states, while an asynchronous training pipeline decouples expensive VLM inference from environment interaction. All VLM components are used only during offline training and are removed at deployment, ensuring real-time feasibility. Experiments in the CARLA simulator show significant improvements in collision avoidance, task success, and generalization across diverse traffic scenarios, including strong robustness under settings without explicit collision penalties. These results demonstrate that DriveVLM-RL provides a practical paradigm for integrating foundation models into autonomous driving without compromising real-time feasibility. Demo video and code are available at: https://zilin-huang.github.io/DriveVLM-RL-website/

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References50

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

Related Papers