Search papers, labs, and topics across Lattice.
This paper investigates the impact of Chain-of-Thought (CoT) prompting on the visual spatial reasoning capabilities of multimodal reasoning models (MRMs) across 13 spatial benchmarks. The key finding is that CoT prompting *degrades* performance in visual spatial reasoning tasks, contrary to its benefits in other domains. Through a "No-Image++" ablation study, the authors further reveal that MRMs and CoT-prompted MLMs exhibit shortcut learning, hallucinating visual details from textual priors even without image input.
Chain-of-Thought prompting, a boon for logical reasoning in LLMs, surprisingly *harms* visual spatial reasoning in multimodal models.
Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.