Tsinghua AICASChina Academy of Space TechnologyKuaishouMeituanVanderbiltFeb 26, 2026arXiv:2602.22623

ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL

Xingyu Lu, Jinpeng Wang, Jinpeng Wang, Yifan Zhang, Shijie Ma, Shijie Ma, Xiao Hu, Xiao Hu, Tianke Zhang, Tianke Zhang, Haonan fan, Haonan Fan, Kaiyu Jiang, Kaiyu Jiang, Changyi Liu, Changyi Liu, Kaiyu Tang, Kaiyu Tang, Bin Wen, Bin Wen, Fan Yang, Fan Yang, Tingting Gao, Tingting Gao, Chun Yuan, Chun Yuan

AI Summary

The paper introduces ContextRL, a reinforcement learning framework designed to improve the knowledge discovery efficiency of Multimodal Large Language Models (MLLMs) by addressing identifiability and reachability challenges in reward modeling. ContextRL enhances identifiability by providing the reward model with full reference solutions as context for fine-grained process verification and improves reachability through a multi-turn sampling strategy with mistake reports to guide policy recovery. Experiments across 11 perception and reasoning benchmarks demonstrate that ContextRL significantly boosts knowledge discovery efficiency, enabling the Qwen3-VL-8B model to match the performance of the 32B model and outperform standard RLVR baselines while mitigating reward hacking.

Key Contribution

Context-augmented RL lets smaller MLLMs punch *way* above their weight, rivaling much larger models on reasoning tasks while dodging reward hacking.

Abstract

We propose ContextRL, a novel framework that leverages context augmentation to overcome these bottlenecks. Specifically, to enhance Identifiability, we provide the reward model with full reference solutions as context, enabling fine-grained process verification to filter out false positives (samples with the right answer but low-quality reasoning process). To improve Reachability, we introduce a multi-turn sampling strategy where the reward model generates mistake reports for failed attempts, guiding the policy to"recover"correct responses from previously all-negative groups. Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency. Notably, ContextRL enables the Qwen3-VL-8B model to achieve performance comparable to the 32B model, outperforming standard RLVR baselines by a large margin while effectively mitigating reward hacking. Our in-depth analysis reveals the significant potential of contextual information for improving reward model accuracy and document the widespread occurrence of reward hacking, offering valuable insights for future RLVR research.

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References57

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL

Related Papers