CASPKUPolyUMay 6, 2026arXiv:2605.04647

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

Huimin Wang, Bihao Cui, Pengxiang Li, Ben Lu, Mingqian Wang, Tong Wang, Chuan Tang, Teng Zhang, Kun Zhan

AI Summary

ReflectDrive-2 is introduced, a discrete diffusion planner for autonomous driving that uses masked decoding to generate trajectory plans represented as discrete tokens. A key innovation is AutoEdit, which allows in-place trajectory revision by rewriting selected tokens using the same model, trained via a novel two-stage procedure involving structure-aware perturbations and reinforcement learning. Results on NAVSIM show that reinforcement learning significantly boosts the performance of AutoEdit, achieving 91.0 PDMS with camera-only input and 94.8 PDMS in an oracle setting, while maintaining low latency on NVIDIA Thor.

Key Contribution

RL fine-tuning unlocks a 6x performance gain for in-place trajectory editing in autonomous driving, demonstrating the power of aligning diffusion planners with reinforcement learning.

Abstract

We introduce ReflectDrive-2, a masked discrete diffusion planner with separate action expert for autonomous driving that represents plans as discrete trajectory tokens and generates them through parallel masked decoding. This discrete token space enables in-place trajectory revision: AutoEdit rewrites selected tokens using the same model, without requiring an auxiliary refinement network. To train this capability, we use a two-stage procedure. First, we construct structure-aware perturbations of expert trajectories along longitudinal progress and lateral heading directions and supervise the model to recover the original expert trajectory. We then fine-tune the full decision--draft--reflect rollout with reinforcement learning (RL), assigning terminal driving reward to the final post-edit trajectory and propagating policy-gradient credit through full-rollout transitions. Full-rollout RL proves crucial for coupling drafting and editing: under supervised training alone, inference-time AutoEdit improves PDMS by at most $0.3$, whereas RL increases its gain to $1.9$. We also co-design an efficient reflective decoding stack for the decision--draft--reflect pipeline, combining shared-prefix KV reuse, Alternating Step Decode, and fused on-device unmasking. On NAVSIM, ReflectDrive-2 achieves $91.0$ PDMS with camera-only input and $94.8$ PDMS in a best-of-6 oracle setting, while running at $31.8$ ms average latency on NVIDIA Thor.

Architecture Design (Transformers, SSMs, MoE)Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

Related Papers