Mar 8, 2026arXiv:2603.07751

3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

Shaoxiong Zhan, Yanlin Lai, Zheng Liu, Hai Lin, Shen Li, Xiaodong Cai, Zijian Lin, Wen Huang, Hai-Tao Zheng

AI Summary

The paper introduces 3ViewSense, a framework that enhances spatial reasoning in Vision-Language Models (VLMs) by grounding it in orthographic views. It addresses the "spatial intelligence gap" where VLMs struggle with basic spatial tasks despite strong language capabilities, by using a "Simulate-and-Reason" mechanism that decomposes scenes into canonical orthographic projections. Experiments on spatial reasoning benchmarks show that 3ViewSense significantly outperforms existing baselines, particularly in occlusion-heavy counting and view-consistent spatial reasoning, leading to more stable and consistent spatial descriptions.

Key Contribution

VLMs can't count blocks because they lack a view-consistent spatial interface, but decomposing scenes into orthographic projections fixes it.

Abstract

Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,'' where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce \textbf{3ViewSense}, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a ``Simulate-and-Reason'' mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates explicit mental rotation and reconstruction. Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning. The framework also improves the stability and consistency of spatial descriptions, offering a scalable path toward stronger spatial intelligence in multimodal systems.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

Related Papers