PurdueUW-MadisonFeb 11, 2026arXiv:2602.10458

Found-RL: foundation model-enhanced reinforcement learning for autonomous driving

Yansong Qu, Zihao Sheng, Zilin Huang, Jiancong Chen, Yuhao Luo, Tianyi Wang, Yiheng Feng, Samuel Labi, Sikai Chen

AI Summary

The paper introduces Found-RL, a platform that enhances reinforcement learning for autonomous driving by integrating vision-language models (VLMs) while addressing their high inference latency. Found-RL employs an asynchronous batch inference framework to decouple VLM reasoning from the RL simulation loop, enabling real-time learning. The method distills VLM knowledge into the RL policy using Value-Margin Regularization (VMR), Advantage-Weighted Action Guidance (AWAG), and Conditional Contrastive Action Alignment, achieving performance comparable to VLMs with significantly faster inference speeds (500 FPS).

Key Contribution

Achieve near-VLM autonomous driving performance with a lightweight RL model running at 500 FPS by asynchronously distilling knowledge from VLMs.

Abstract

Reinforcement Learning (RL) has emerged as a dominant paradigm for end-to-end autonomous driving (AD). However, RL suffers from sample inefficiency and a lack of semantic interpretability in complex scenarios. Foundation Models, particularly Vision-Language Models (VLMs), can mitigate this by offering rich, context-aware knowledge, yet their high inference latency hinders deployment in high-frequency RL training loops. To bridge this gap, we present Found-RL, a platform tailored to efficiently enhance RL for AD using foundation models. A core innovation is the asynchronous batch inference framework, which decouples heavy VLM reasoning from the simulation loop, effectively resolving latency bottlenecks to support real-time learning. We introduce diverse supervision mechanisms: Value-Margin Regularization (VMR) and Advantage-Weighted Action Guidance (AWAG) to effectively distill expert-like VLM action suggestions into the RL policy. Additionally, we adopt high-throughput CLIP for dense reward shaping. We address CLIP's dynamic blindness via Conditional Contrastive Action Alignment, which conditions prompts on discretized speed/command and yields a normalized, margin-based bonus from context-specific action-anchor scoring. Found-RL provides an end-to-end pipeline for fine-tuned VLM integration and shows that a lightweight RL model can achieve near-VLM performance compared with billion-parameter VLMs while sustaining real-time inference (approx. 500 FPS). Code, data, and models will be publicly available at https://github.com/ys-qu/found-rl.

Multimodal Models Robotics & Embodied AI Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References21

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Found-RL: foundation model-enhanced reinforcement learning for autonomous driving

Related Papers