Mar 18, 2026arXiv:2603.18002

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang, Marc Pollefeys

AI Summary

Loc3R-VLM is introduced, a framework that enhances 2D Vision-Language Models (VLMs) with 3D understanding using monocular video input. It achieves this by incorporating two joint objectives: global layout reconstruction and explicit situation modeling, providing direct spatial supervision. Leveraging camera pose priors from a pre-trained 3D foundation model ensures geometric consistency, and the model achieves state-of-the-art performance on language-based localization and 3D question-answering benchmarks.

Key Contribution

Forget expensive 3D training data: Loc3R-VLM shows how to give 2D vision-language models strong 3D spatial reasoning by distilling knowledge from a pretrained 3D foundation model using only monocular video.

Abstract

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References77

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Related Papers