Tsinghua AIGroupShenzhen University of AdvancedXJTUMay 25, 2026arXiv:2605.25524

ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs

Jiangyang Li, Cong Wan, Changjie Wu, Songlin Dong, Lingjun Zhang, Linzhe Shi, Xu Wang, Zhiheng Ma, Hang Zhang, Mu Xu, Yihong Gong

AI Summary

This paper introduces ProSR, a process-shaping optimization framework to improve spatial reasoning in VLMs by explicitly addressing spurious grounding and tail instability during chain-of-thought reasoning. ProSR uses a Counterfactual Invariance Penalty to enforce visual dependence and a Tail Drift Penalty to promote trajectory stability. Experiments on spatial reasoning benchmarks demonstrate that ProSR enhances answer accuracy and generates more visually grounded and stable reasoning trajectories.

Key Contribution

VLMs often fail at spatial reasoning because they either ignore visual cues or exhibit unstable reasoning, but a novel process-shaping framework can fix this.

Abstract

Reliable spatial reasoning remains a core bottleneck for vision-language models (VLMs). Existing mainstream training paradigms for spatial reasoning largely rely on outcome alignment or process imitation, lacking explicit constraints on the reasoning process, and therefore struggle to ensure genuine visual dependence and stable reasoning trajectories. In this paper, we construct a high-quality CoT dataset covering diverse spatial phenomena and diagnose the model's reasoning process, revealing two typical types of process degradation during reinforcement learning optimization: Spurious Grounding, which bypasses visual evidence, and Tail Instability, where uncertainty abnormally rises in the later stage of reasoning. To address these issues, we propose ProSR, a process-shaping optimization framework for spatial reasoning. Through a Counterfactual Invariance Penalty and a Tail Drift Penalty, ProSR extends the optimization objective from single answer correctness to two process-level dimensions: visual dependence and trajectory stability. Experiments on multiple complex and out-of-distribution spatial reasoning benchmarks show that ProSR improves answer accuracy while generating reasoning trajectories that are more stable and more dependent on visual evidence.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs

Related Papers