May 22, 2026arXiv:2605.23163

Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

Kewei Zhang, Sensen Gao, Yulong Cao, Song Han, B. Ivanovic, Langechuan Liu, Marco Pavone, Daquan Zhou, Enze Xie

AI Summary

Fast-dDrive, a novel block-diffusion Vision-Language-Action model, addresses the limitations of autoregressive and full-sequence diffusion models in autonomous driving by performing bidirectional refinement within semantic units while enforcing causal ordering across them. It leverages the structured nature of driving VLA outputs by freezing structural tokens into a scaffold and employing a section-aware training recipe. Scaffold Speculative Decoding and a test-time scaling scheme further enhance throughput and suppress prediction variance, achieving state-of-the-art results on WOD-E2E and nuScenes datasets.

Key Contribution

By structuring diffusion-based driving models around a "scaffold" of frozen structural tokens, Fast-dDrive achieves a 12x speedup over autoregressive baselines while improving trajectory accuracy.

Abstract

End-to-end autonomous driving via Vision-Language-Action (VLA) models demands a precarious balance between high-fidelity trajectory planning and efficient inference. Existing paradigms typically fall short: autoregressive (AR) VLAs are memory-bandwidth-bound on edge hardware and prone to exposure-bias drift, while full-sequence diffusion models preclude KV-cache reuse and suffer from"logical leakage"that violates the fundamental perceive-then-plan causality. We present Fast-dDrive, a block-diffusion VLA that performs bidirectional refinement within semantic units while enforcing strict causal ordering across them. Leveraging the observation that driving VLAs often emit structured JSON-like outputs, Fast-dDrive freezes structural tokens into a section scaffold and employs a section-aware training recipe that prioritizes safety-critical planning. We further introduce Scaffold Speculative Decoding to achieve AR-equivalent quality at significantly higher throughput. Finally, we propose a low-overhead test-time scaling scheme: by forking $N$ stochastic trajectory rollouts from a single shared-prefix KV cache and averaging them, we effectively suppress prediction variance at a fractional computational cost. Empirical results demonstrate that Fast-dDrive redefines the speed-accuracy frontier for driving agents. On the WOD-E2E test set, Fast-dDrive achieves SOTA ADE@3s and ADE@5s, alongside the highest RFS among diffusion-based VLAs; on nuScenes, it reduces average L2 error to $0.32$m (a $22\%$ improvement). When integrated with SGLang, our framework delivers $12\times$ throughput speedup over the AR baseline, narrowing the gap between high-capacity VLAs and the efficiency demands of real-time on-vehicle deployment.

Inference & Quantization Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

Related Papers