WestlakeJun 2, 2026arXiv:2606.03577

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Hao Zhong, Muzhi Zhu, Shenyan Zeng, Anzhou Li, Cong Chen, Huakang Geng, Duochao Shi, Wen-song Ye, Tao Lin, Chunhua Shen

AI Summary

This paper introduces ReasonMatch-Bench, a novel benchmark designed to evaluate multimodal large language models (MLLMs) on complex spatial reasoning tasks involving wide-baseline matching, which integrates geometric understanding and viewpoint changes. The authors demonstrate that current MLLMs perform poorly on fine-grained correspondence tasks, achieving only 37.2 F1 on a challenging subset compared to 84.0 F1 by human annotators. To address this gap, they develop a scalable data-generation pipeline and propose Dynamic Correspondence Reinforcement Learning (DCRL), which significantly enhances MLLM performance on ReasonMatch-Bench and related tasks while preserving general visual understanding capabilities.

Key Contribution

Current MLLMs struggle with fine-grained spatial reasoning, achieving only 37.2 F1 on challenging tasks compared to human performance of 84.0 F1.

Abstract

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References65

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Related Papers