Apr 8, 2026arXiv:2604.06725

Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

AI Summary

This paper introduces a training-free framework for enhancing Multimodal Large Language Model (MLLM) spatial reasoning by integrating explicit 3D scene reconstruction and active view synthesis. The framework reconstructs a 3D mesh from a single image using MLLM-guided keyword extraction and mask generation, then iteratively computes optimal camera viewpoints to synthesize novel views. Experiments show the approach significantly improves spatial comprehension, outperforming specialized spatial models and advanced MLLMs like GPT-5.2 and Gemini-2.5-Flash on benchmarks such as 3DSRBench and Rel3D.

Key Contribution

Forget expensive post-training or rigid tool-calling: this training-free framework lets MLLMs achieve state-of-the-art 3D spatial reasoning simply by actively exploring reconstructed 3D scenes.

Abstract

Although Multimodal Large Language Models have achieved remarkable progress, they still struggle with complex 3D spatial reasoning due to the reliance on 2D visual priors. Existing approaches typically mitigate this limitation either through computationally expensive post-training procedures on limited 3D datasets or through rigid tool-calling mechanisms that lack explicit geometric understanding and viewpoint flexibility. To address these challenges, we propose a \textit{training-free} framework that introduces a Visual Chain-of-Thought mechanism grounded in explicit 3D reconstruction. The proposed pipeline first reconstructs a high-fidelity 3D mesh from a single image using MLLM-guided keyword extraction and mask generation at multiple granularities. Subsequently, the framework leverages an external knowledge base to iteratively compute optimal camera extrinsic parameters and synthesize novel views, thereby emulating human perspective-taking. Extensive experiments demonstrate that the proposed approach significantly enhances spatial comprehension. Specifically, the framework outperforms specialized spatial models and general-purpose MLLMs, including \textit{GPT-5.2} and \textit{Gemini-2.5-Flash}, on major benchmarks such as 3DSRBench and Rel3D.

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References65

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

Related Papers