Apr 2, 2026arXiv:2604.01907

Lifting Unlabeled Internet-level Data for 3D Scene Understanding

Yixin Chen, Yaowei Zhang, Huangyue Yu, Junchao He, Ju He, Yan Wang, Jiangyong Huang, Hongyu Shen, Junfeng Ni, Shaofei Wang, Baoxiong Jia, Song-Chun Zhu, Siyuan Huang

AI Summary

This paper introduces a data engine that leverages unlabeled internet videos to automatically generate training data for 3D scene understanding tasks. They identify key bottlenecks in automated data generation and demonstrate the effectiveness of their approach across tasks ranging from 3D object detection to 3D spatial VQA and VLN. Models trained on the generated data exhibit strong zero-shot performance and improve further with fine-tuning, highlighting the potential of web data for training 3D scene understanding systems.

Key Contribution

Unlock the potential of readily available internet videos to train 3D scene understanding models, achieving strong zero-shot performance and paving the way for more capable systems.

Abstract

Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References123

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Lifting Unlabeled Internet-level Data for 3D Scene Understanding

Related Papers