Apr 16, 2026arXiv:2604.15308

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Hao Gao, Shaoyu Chen, Yifan Zhu, Yuehao Song, Wenyu Liu, Qian Zhang, Xinggang Wang

AI Summary

RAD-2 is introduced as a generator-discriminator framework for autonomous driving, using a diffusion-based generator for trajectory candidates and an RL-optimized discriminator for reranking. The method incorporates Temporally Consistent Group Relative Policy Optimization and On-policy Generator Optimization to improve RL and shift the generator towards high-reward trajectories. Experiments in a new high-throughput simulation environment, BEV-Warp, show RAD-2 reduces collision rates by 56% compared to diffusion-based planners, with real-world deployment demonstrating improved safety and smoothness.

Key Contribution

Autonomous vehicles can now navigate complex urban environments with significantly improved safety and smoothness thanks to a novel generator-discriminator framework that avoids directly applying sparse rewards to the high-dimensional trajectory space.

Abstract

High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.

RLHF & Preference Learning Robotics & Embodied AI Training Efficiency & Optimization World Models & Planning

Citation Metrics

Citations0

Influential citations0

References65

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Related Papers