Beijing Key Laboratory of IntelligentBUPTJun 1, 2026arXiv:2606.02459

Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

AI Summary

This paper introduces a novel agentic pipeline for spatial reasoning in Vision-Language Models (VLMs) by leveraging a dynamic cognitive map and Spatial Assertion Codes (SAC). By treating VLMs as active agents rather than passive observers, the proposed method enhances the model's ability to navigate and reason about spatial relationships, addressing the limitations of existing reinforcement learning approaches that rely on sparse rewards. Experimental results on the MindCube benchmark reveal a significant performance boost, achieving an overall accuracy of 80.5%, which surpasses the previous best method by 29.5 points, marking a relative improvement of 53.2% on the challenging Rotation subset.

Key Contribution

Transforming VLMs into active agents with cognitive maps leads to a staggering 53.2% boost in spatial reasoning accuracy.

Abstract

Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiveness for complex reasoning tasks. Inspired by pigeons' building and exploiting cognitive maps for navigation, we propose a novel agentic pipeline for spatial reasoning. First, we introduce a new \emph{dynamic cognitive map} parameterizing scene layout as object positions and orientations, serving as persistent memory for new observations. Second, we propose a novel \emph{Spatial Assertion Codes (SAC)}, Python expressions programmatically describing spatial relationships. By collaborating with the dynamic cognitive map, SAC enables verification of intermediate reasoning steps, providing dense reward signals. We optimize the model via supervised and reinforcement finetuning. Experiments on the MindCube benchmark demonstrate state-of-the-art performance with \emph{80.5\%} overall accuracy, outperforming the best current method by \emph{29.5} accuracy points (a relative improvement of \emph{53.2\%}) on the challenging \textsc{Rotation} subset. Our code and data are open-sourced at https://github.com/dw-dengwei/active-spatial-reasoning.git.

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

Related Papers