NVIDIABeihangHKUUCSDUniversity of CaliforniaMay 28, 2026arXiv:2605.30307

Grounded 3D-Aware Spatial Vision-Language Modeling

An-Chieh Cheng, Yang Fu, Yang Fu, Yatai Ji, Yatai Ji, Ligeng Zhu, Guanqi Zhan, Guanqi Zhan, Zhuoyang Zhang, Zhaojing Yang, Song Han, Song Han, Yao Lu, Yao Lu, Pavlo Molchanov, Vidya Murali, Vidya Nariyambut Murali, Jan Kautz, Jan Kautz, Xiaolong Wang, Hongxu Yin, Hongxu Yin, Sifei Liu, Sifei Liu

AI Summary

The paper introduces GR3D, a spatial vision-language model with explicit 2D, implicit 2D, and monocular 3D grounding capabilities. GR3D uses an implicit grounding mechanism to identify entity mentions and insert corresponding region tokens into the text stream, enabling reference to visual evidence during spatial chain-of-thought reasoning. The model also employs a region-prompted monocular 3D grounding design to predict 3D bounding boxes, achieving improved performance on spatial benchmarks and demonstrating grounding as a beneficial inductive bias.

Key Contribution

Grounding boosts spatial reasoning in VLMs: explicitly linking language to 2D and 3D scene elements lets models decompose complex spatial problems and improve performance even on non-grounded tasks.

Abstract

We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities--explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding--within a single framework. GR3D introduces an implicit grounding mechanism that identifies entity mentions during generation and inserts the corresponding region tokens into the text stream, allowing the model to reference visual evidence on the fly when producing spatial chain-of-thought responses. In parallel, a region-prompted monocular 3D grounding design predicts 3D bounding boxes in the camera view from grounded region queries, supported by intrinsic-aware normalization and dense geometric supervision. Together, these grounding capabilities enable GR3D to decompose complex spatial understanding problems into grounded 2D perception followed by 3D inference. GR3D achieves consistent improvements across grounded and non-grounded spatial benchmarks, demonstrating grounding as an effective inductive bias for strengthening spatial understanding in VLMs. These grounding capabilities collectively enhance general spatial understanding beyond the grounding task itself.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References124

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Grounded 3D-Aware Spatial Vision-Language Modeling

Related Papers