SJTUMar 8, 2026arXiv:2603.07660

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

Yuanyuan Gao, Hao Li, Xinhao Ji, Yuning Gong, Yuanjun Liao, Fangfu Liu, Manyuan Zhang, Yuchen Yang, Xue Yang, Huaxi Huang, Hongjie Zhang, Ziwei Liu, Xiao Sun, Dingwen Zhang, Zhihang Zhong

AI Summary

Holi-Spatial introduces a fully automated pipeline for constructing large-scale, spatially-aware multimodal datasets from raw video inputs, addressing the limitations of existing manually annotated datasets. The pipeline generates geometrically accurate 3D Gaussian Splatting reconstructions with rendered depth maps, object-level semantic annotations, and spatial Question-Answer pairs. The resulting Holi-Spatial-4M dataset, containing 12K optimized 3DGS scenes and millions of annotations, significantly outperforms existing methods in data curation quality and improves VLM performance on spatial reasoning tasks.

Key Contribution

Forget hand-annotated 3D datasets: a new automated pipeline generates massive, high-quality 3D spatial intelligence from raw video, unlocking better VLM reasoning.

Abstract

The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question-answer (QA) pairs from a limited number of manually annotated datasets, rather than systematically annotating new large-scale 3D scenes from raw web data. As a result, their scalability is severely constrained, and model performance is further hindered by domain gaps inherent in these narrowly curated datasets. In this work, we propose Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline. Holi-Spatial supports multi-level spatial supervision, ranging from geometrically accurate 3D Gaussian Splatting (3DGS) reconstructions with rendered depth maps to object-level and relational semantic annotations, together with corresponding spatial Question-Answer (QA) pairs. Following a principled and systematic pipeline, we further construct Holi-Spatial-4M, the first large-scale, high-quality 3D semantic dataset, containing 12K optimized 3DGS scenes, 1.3M 2D masks, 320K 3D bounding boxes, 320K instance captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs spanning diverse geometric, relational, and semantic reasoning tasks. Holi-Spatial demonstrates exceptional performance in data curation quality, significantly outperforming existing feed-forward and per-scene optimized methods on datasets such as ScanNet, ScanNet++, and DL3DV. Furthermore, fine-tuning Vision-Language Models (VLMs) on spatial reasoning tasks using this dataset has also led to substantial improvements in model performance.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

Related Papers