Search papers, labs, and topics across Lattice.
This paper introduces Perception-Reasoning Coevolution (PRCO), a dual-role reinforcement learning framework for multimodal large language models (MLLMs) that separates perception and reasoning optimization. PRCO uses an Observer to generate evidence captions and a Solver to predict the final answer, with role-specific rewards: outcome rewards for the Solver and utility rewards based on the Solver's success for the Observer. Experiments on eight multimodal reasoning benchmarks show that PRCO improves accuracy by over 7 points compared to the base model and outperforms existing RL-tuned baselines, demonstrating the effectiveness of disentangled optimization.
Disentangling perception and reasoning with role-specific rewards in multimodal LLMs boosts accuracy by 7 points, revealing a critical bottleneck in existing joint optimization approaches.
Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.