Search papers, labs, and topics across Lattice.
Fast-dDrive, a novel block-diffusion Vision-Language-Action model, addresses the limitations of autoregressive and full-sequence diffusion models in autonomous driving by performing bidirectional refinement within semantic units while enforcing causal ordering across them. It leverages the structured nature of driving VLA outputs by freezing structural tokens into a scaffold and employing a section-aware training recipe. Scaffold Speculative Decoding and a test-time scaling scheme further enhance throughput and suppress prediction variance, achieving state-of-the-art results on WOD-E2E and nuScenes datasets.
By structuring diffusion-based driving models around a "scaffold" of frozen structural tokens, Fast-dDrive achieves a 12x speedup over autoregressive baselines while improving trajectory accuracy.
End-to-end autonomous driving via Vision-Language-Action (VLA) models demands a precarious balance between high-fidelity trajectory planning and efficient inference. Existing paradigms typically fall short: autoregressive (AR) VLAs are memory-bandwidth-bound on edge hardware and prone to exposure-bias drift, while full-sequence diffusion models preclude KV-cache reuse and suffer from"logical leakage"that violates the fundamental perceive-then-plan causality. We present Fast-dDrive, a block-diffusion VLA that performs bidirectional refinement within semantic units while enforcing strict causal ordering across them. Leveraging the observation that driving VLAs often emit structured JSON-like outputs, Fast-dDrive freezes structural tokens into a section scaffold and employs a section-aware training recipe that prioritizes safety-critical planning. We further introduce Scaffold Speculative Decoding to achieve AR-equivalent quality at significantly higher throughput. Finally, we propose a low-overhead test-time scaling scheme: by forking $N$ stochastic trajectory rollouts from a single shared-prefix KV cache and averaging them, we effectively suppress prediction variance at a fractional computational cost. Empirical results demonstrate that Fast-dDrive redefines the speed-accuracy frontier for driving agents. On the WOD-E2E test set, Fast-dDrive achieves SOTA ADE@3s and ADE@5s, alongside the highest RFS among diffusion-based VLAs; on nuScenes, it reduces average L2 error to $0.32$m (a $22\%$ improvement). When integrated with SGLang, our framework delivers $12\times$ throughput speedup over the AR baseline, narrowing the gap between high-capacity VLAs and the efficiency demands of real-time on-vehicle deployment.