BAIRApplied IntuitionFeb 25, 2026arXiv:2602.22091

Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos

Matthew Strong, Wei-Jer Chang, Quentin Herau, Jiezhi Yang, Yihan Hu, Chensheng Peng, Wei Zhan

AI Summary

The paper introduces Label-Free Generation (LFG), a teacher-guided framework for pretraining autonomous driving representations from unposed, unlabeled YouTube videos. LFG leverages a feedforward architecture with an autoregressive module, trained with multi-modal pseudo-labels (point maps, camera poses, semantic segmentation, motion masks) generated by teacher models. The resulting pretrained encoder demonstrates strong transfer learning performance on downstream tasks like autonomous driving planning (NAVSIM), semantic segmentation, and motion prediction, outperforming multi-camera and LiDAR baselines.

Key Contribution

Unlock autonomous driving with YouTube: a new label-free pretraining method learns driving representations directly from unposed in-the-wild videos, outperforming LiDAR baselines with only a single monocular camera.

Abstract

Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic segmentation, and motion masks. Multi-modal teachers provide sequence-level pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi-camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and qualitative motion prediction tasks. These geometry and motion-aware features position LFG as a compelling video-centric foundation model for autonomous driving.

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos

Related Papers