College of Software EngineeringCUHKLi AutoNJUNorthwesternPolyUSchool of Computer Science and EngineeringSchool of Computing and InformationSchool of Cyber Science and EngineeringSEUUniversityJun 4, 2026arXiv:2606.05774

LiAuto-GeoX: Efficient Grounded Driving Transformer

Jiawei Lian, Haoyi Sun, Yang Wu, Lifu Mu, Siyuan Wang, Le Hui, Ning Mao, Tao Wei, Pan Zhou, Kun Zhan, Jian Yang

AI Summary

This paper introduces LiAuto-GeoX, an efficient grounded driving transformer that addresses the challenges of real-time, onboard dense 3D reconstruction for autonomous driving. By leveraging large-scale surround-view data and sparse LiDAR priors, the model achieves high geometric fidelity and surround-view consistency while maintaining a compact architecture of only 155 million parameters through a novel geometry-preserving distillation framework. Evaluations show that LiAuto-GeoX operates at 220 FPS on the KITTI dataset, significantly enhancing trajectory and occupancy prediction tasks, thereby positioning dense 3D reconstruction as a foundational element for future autonomous driving systems.

Key Contribution

Efficient dense 3D reconstruction can now serve as a scalable foundation for next-generation autonomous driving, achieving real-time performance without sacrificing fidelity.

Abstract

Dense 3D reconstruction has demonstrated immense potential for spatial understanding, yet its viability as a real-time, onboard representation for autonomous driving remains an open challenge. Existing large-scale visual geometry models typically require substantial computational resources and lack the long-range geometric fidelity, surround-view consistency, and real-time efficiency demanded by dynamic driving environments. To bridge this gap, we present \textbf{LiAuto-GeoX}, an efficient grounded driving transformer designed for deployable, ego-centric 3D scene understanding. Our approach begins by learning a high-capacity driving geometry model from large-scale surround-view data, utilizing sparse LiDAR priors to provide robust geometric grounding in distant, ambiguous, or structure-sparse regions. We then instantiate this capability into a highly compact 155M-parameter onboard model through a novel geometry-preserving distillation framework. This framework employs mask-guided depth-aware distillation to retain fine-grained metric structures by emphasizing geometrically informative regions, and relative-pose relational distillation to enforce cross-view spatial consistency through pose-induced geometric relations. Extensive evaluations reveal that \textbf{LiAuto-GeoX} runs at 220 FPS on KITTI while maintaining high-fidelity dense reconstruction, enabling real-time deployment. The learned geometry transfers seamlessly to downstream autonomy tasks, achieving 90.6 PDMS in trajectory prediction, 24.63 mIoU in occupancy prediction, and 47.67 IoU in future-frame prediction. These all demonstrate that efficient dense 3D reconstruction can transcend its traditional role as a perception target to serve as a scalable, foundational geometric representation for next-generation autonomous driving.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LiAuto-GeoX: Efficient Grounded Driving Transformer

Related Papers