MelbourneFeb 25, 2026arXiv:2602.21619

When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

AI Summary

This paper investigates the impact of injecting spatial cues, commonsense knowledge, and chain-of-thought (CoT) prompting into vision-language models (VLMs) for visual spatial reasoning (VSR). Through a systematic analysis across three VLMs and two benchmarks, the authors demonstrate that indiscriminate information injection can degrade performance. The study reveals that targeted spatial cues, relevant commonsense knowledge, and precise spatial grounding are crucial for effective VSR, suggesting that selective information injection is key.

Key Contribution

Throwing more information at VLMs doesn't fix their spatial reasoning problems—in fact, too much irrelevant context or commonsense knowledge actively hurts performance.

Abstract

Visual spatial reasoning (VSR) remains challenging for modern vision-language models (VLMs), despite advances in multimodal architectures. A common strategy is to inject additional information at inference time, such as explicit spatial cues, external commonsense knowledge, or chain-of-thought (CoT) reasoning instructions. However, it remains unclear when such information genuinely improves reasoning and when it introduces noise. In this paper, we conduct a hypothesis-driven analysis of information injection for VSR across three representative VLMs and two public benchmarks. We examine (i) the type and number of spatial contexts, (ii) the amount and relevance of injected commonsense knowledge, and (iii) the interaction between spatial grounding and CoT prompting. Our results reveal a consistent pattern: more information does not necessarily yield better reasoning. Targeted single spatial cues outperform multi-context aggregation, excessive or weakly relevant commonsense knowledge degrades performance, and CoT prompting improves accuracy only when spatial grounding is sufficiently precise. These findings highlight the importance of selective, task-aligned information injection and provide practical guidance for designing reliable multimodal reasoning pipelines.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

Related Papers