Meta AIApr 9, 2026arXiv:2604.07990

SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

Yunnan Wang, Kecheng Zheng, Jianyuan Wang, Minghao Chen, David Novotny, David Novotný, Christian Rupprecht, Yinghao Xu, Xing Zhu, Wenjun Zeng, Xin Jin, Yujun Shen

AI Summary

SceneScribe-1M is introduced as a large-scale video dataset comprising one million in-the-wild videos, each annotated with textual descriptions, camera parameters, dense depth maps, and 3D point tracks. This dataset aims to bridge the gap between 3D understanding and video generation by providing a unified resource with rich semantic and spatio-temporal information. Benchmarks are established across tasks like monocular depth estimation, scene reconstruction, dynamic point tracking, and text-to-video synthesis to demonstrate the dataset's versatility.

Key Contribution

A million videos with paired depth, camera pose, and 3D point tracks could unlock a new wave of 3D-aware video models.

Abstract

The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References71

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

Related Papers