Search papers, labs, and topics across Lattice.
This paper introduces LiAuto-GeoX, an efficient grounded driving transformer that addresses the challenges of real-time, onboard dense 3D reconstruction for autonomous driving. By leveraging large-scale surround-view data and sparse LiDAR priors, the model achieves high geometric fidelity and surround-view consistency while maintaining a compact architecture of only 155 million parameters through a novel geometry-preserving distillation framework. Evaluations show that LiAuto-GeoX operates at 220 FPS on the KITTI dataset, significantly enhancing trajectory and occupancy prediction tasks, thereby positioning dense 3D reconstruction as a foundational element for future autonomous driving systems.
Efficient dense 3D reconstruction can now serve as a scalable foundation for next-generation autonomous driving, achieving real-time performance without sacrificing fidelity.
Dense 3D reconstruction has demonstrated immense potential for spatial understanding, yet its viability as a real-time, onboard representation for autonomous driving remains an open challenge. Existing large-scale visual geometry models typically require substantial computational resources and lack the long-range geometric fidelity, surround-view consistency, and real-time efficiency demanded by dynamic driving environments. To bridge this gap, we present \textbf{LiAuto-GeoX}, an efficient grounded driving transformer designed for deployable, ego-centric 3D scene understanding. Our approach begins by learning a high-capacity driving geometry model from large-scale surround-view data, utilizing sparse LiDAR priors to provide robust geometric grounding in distant, ambiguous, or structure-sparse regions. We then instantiate this capability into a highly compact 155M-parameter onboard model through a novel geometry-preserving distillation framework. This framework employs mask-guided depth-aware distillation to retain fine-grained metric structures by emphasizing geometrically informative regions, and relative-pose relational distillation to enforce cross-view spatial consistency through pose-induced geometric relations. Extensive evaluations reveal that \textbf{LiAuto-GeoX} runs at 220 FPS on KITTI while maintaining high-fidelity dense reconstruction, enabling real-time deployment. The learned geometry transfers seamlessly to downstream autonomy tasks, achieving 90.6 PDMS in trajectory prediction, 24.63 mIoU in occupancy prediction, and 47.67 IoU in future-frame prediction. These all demonstrate that efficient dense 3D reconstruction can transcend its traditional role as a perception target to serve as a scalable, foundational geometric representation for next-generation autonomous driving.