Apr 17, 2026arXiv:2604.16060

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Sai Srinivas Kancheti, Aditya Kanade, Vineeth N. Balasubramanian, Tanuja Ganu

AI Summary

This paper investigates the impact of Chain-of-Thought (CoT) prompting on the visual spatial reasoning capabilities of multimodal reasoning models (MRMs) across 13 spatial benchmarks. The key finding is that CoT prompting *degrades* performance in visual spatial reasoning tasks, contrary to its benefits in other domains. Through a "No-Image++" ablation study, the authors further reveal that MRMs and CoT-prompted MLMs exhibit shortcut learning, hallucinating visual details from textual priors even without image input.

Key Contribution

Chain-of-Thought prompting, a boon for logical reasoning in LLMs, surprisingly *harms* visual spatial reasoning in multimodal models.

Abstract

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Related Papers