SJTUTongjiJun 10, 2026arXiv:2606.11683

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

Chaofan Ma, Zhenjie Mao, Yuhuan Yang, Fanqin Zeng, Yue Shi, Xiaofeng Cao, Jiangchao Yao

AI Summary

This paper introduces a novel framework called Reason, then Re-reason (ReRe) that enhances spatial reasoning in egocentric videos by allowing models to revisit and revise hypotheses based on new viewpoints. By implementing a two-phase process—initial hypothesis formation followed by verification through synthesized novel-view videos—the authors address the limitations of single-turn inference that often leads to geometric ambiguity. Extensive evaluations on VSI-Bench and STI-Bench show that ReRe significantly improves the performance of open-source MLLMs, bringing them closer to proprietary models in spatial reasoning tasks.

Key Contribution

Revisiting spatial reasoning allows models to correct initial hypotheses with new perspectives, dramatically enhancing their accuracy in complex environments.

Abstract

Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/

Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

Related Papers