Search papers, labs, and topics across Lattice.
This paper introduces DVGT-2, a Vision-Geometry-Action model for end-to-end autonomous driving that uses dense 3D geometry as a key representation. To enable online planning, DVGT-2 employs temporal causal attention and caches historical features within a sliding window, avoiding redundant computations. Experiments demonstrate that DVGT-2 achieves superior geometry reconstruction and can be directly applied to planning across diverse camera configurations without fine-tuning.
Ditch language descriptions: this new driving model leverages dense 3D geometry for superior autonomous driving performance and cross-camera generalization.
End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.