Mar 19, 2026arXiv:2603.19235

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Xianjin Wu, Xian Wu, Dingkang Liang, Tianrui Feng, Tianrui Feng, Kui Xia, Kui Xia, Yumeng Zhang, Yumeng Zhang, Xiaofan Li, X. Tan, Xiao Tan, Xiangyu Bai

AI Summary

The paper introduces VEGA-3D, a novel framework that leverages the implicit 3D spatial prior learned by pre-trained video diffusion models to enhance MLLMs' geometric reasoning. VEGA-3D extracts spatiotemporal features from intermediate noise levels of the diffusion model and fuses them with semantic representations using a token-level adaptive gated fusion. Experiments show that VEGA-3D significantly improves performance on 3D scene understanding, spatial reasoning, and embodied manipulation tasks compared to methods relying on explicit 3D modalities.

Key Contribution

MLLMs can gain surprisingly strong 3D spatial reasoning abilities simply by tapping into the latent knowledge already present in video generation models.

Abstract

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

Computer Vision Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References93

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Related Papers