Search papers, labs, and topics across Lattice.
The paper addresses the failure of Reinforcement Learning with Verifiable Rewards (RLVR) in long-context scenarios due to sparse answer-based rewards that don't effectively guide models to identify relevant evidence. They formally prove that outcome-only rewards lead to vanishing gradients for context grounding. To mitigate this, they introduce LongRLVR, which augments the sparse answer reward with a dense, verifiable context reward that incentivizes the selection of correct grounding information. LongRLVR significantly outperforms standard RLVR on long-context benchmarks using Qwen and LLaMA models.
RLHF struggles with long contexts because the reward signal for *finding* the right information vanishes, but can be revived by directly rewarding the model for selecting relevant context.
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding--the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at https://github.com/real-absolute-AI/LongRLVR.