Search papers, labs, and topics across Lattice.
The paper introduces ContextRL, a reinforcement learning framework designed to improve the knowledge discovery efficiency of Multimodal Large Language Models (MLLMs) by addressing identifiability and reachability challenges in reward modeling. ContextRL enhances identifiability by providing the reward model with full reference solutions as context for fine-grained process verification and improves reachability through a multi-turn sampling strategy with mistake reports to guide policy recovery. Experiments across 11 perception and reasoning benchmarks demonstrate that ContextRL significantly boosts knowledge discovery efficiency, enabling the Qwen3-VL-8B model to match the performance of the 32B model and outperform standard RLVR baselines while mitigating reward hacking.
Context-augmented RL lets smaller MLLMs punch *way* above their weight, rivaling much larger models on reasoning tasks while dodging reward hacking.
We propose ContextRL, a novel framework that leverages context augmentation to overcome these bottlenecks. Specifically, to enhance Identifiability, we provide the reward model with full reference solutions as context, enabling fine-grained process verification to filter out false positives (samples with the right answer but low-quality reasoning process). To improve Reachability, we introduce a multi-turn sampling strategy where the reward model generates mistake reports for failed attempts, guiding the policy to"recover"correct responses from previously all-negative groups. Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency. Notably, ContextRL enables the Qwen3-VL-8B model to achieve performance comparable to the 32B model, outperforming standard RLVR baselines by a large margin while effectively mitigating reward hacking. Our in-depth analysis reveals the significant potential of contextual information for improving reward model accuracy and document the widespread occurrence of reward hacking, offering valuable insights for future RLVR research.