Guanzheng Chen

M across all benchmarks (73.17→88.9073.17\!\rightarrow\!88.90 on RULER-QA, 40.2→46.540.2\!\rightarrow\!46.5 on LongBench v2, and 73.55→78.4273.55\!\rightarrow\!78.42 on LongReason). By successfully training models to ground their reasoning in provided context, LongRLVR not only overcomes the limitations of conventional RLVR but empower these models with remarkable long-context reasoning abilities comparable with, and even superior to, state-of-the-art reasoning models such as Qwen3 (Qwen, 2025) series. 2 Method In this section, we introduce LongRLVR to remedy the limitations of RLVR in long-context tasks. We first present an explicit grounding formulation for long-context RLVR in Section 2.1. Next, in Section 2.2, we formally prove that outcome-only rewards lead to a vanishing gradient problem for this grounding process. To solve this, we introduce our verifiable context reward, presenting its theoretical foundation in Section 2.3.1 and a practical F-score-based implementation in Section 2.3.2. Finally, we detail the synthetic data generation pipeline that enables this approach in Section 2.4. 2.1 RLVR on Long Contexts: An Explicit Grounding Formulation The standard RLVR framework aims to optimize a policy πθ(y∣X,Q)\pi_{\theta}(y\mid X,Q) that generates an answer yy given a context XX and a question QQ. The objective is to maximize the expected verifiable reward rans(y)r_{\text{ans}}(y), which typically evaluates the correctness of the final answer: Jans(θ)=𝔼(X,Q)∼𝒟[𝔼y∼πθ(y∣X,Q)[rans(y)]].J_{\text{ans}}(\theta)=\mathbb{E}_{(X,Q)\sim\mathcal{D}}\left[\mathbb{E}_{y\sim\pi_{\theta}(y\mid X,Q)}[r_{\text{ans}}(y)]\right]. (1) This formulation, while effective for tasks where reasoning relies on parametric knowledge, ignores two distinct processes in long-context scenarios: (1) contextual grounding, the act of identifying the relevant subset of information within XX, and (2) answer generation, the act of synthesizing an answer from the grounded information. When the context XX is extensive, the grounding process becomes non-trivial yet remains implicit within the monolithic policy πθ(y∣X,Q)\pi_{\theta}(y\!\mid\!X,Q). Here, we refactor the policy to explicitly model these two stages. Let the long context XX be segmented into a set of NN chunks, C={c1,…,cN}C=\{c_{1},\dots,c_{N}\}, the long-context policy should jointly involve grounding and answering to identify a subset of selected chunks Z⊆CZ\subseteq C and a final answer yy. This process is modeled as a factorized distribution: πθ(y,Z∣X,Q)=πθgnd(Z∣X,Q)⏟Grounding Head⋅πθans(y∣X,Q,Z)⏟Answer Head.\pi_{\theta}(y,Z\mid X,Q)=\underbrace{\pi_{\theta}^{\text{gnd}}(Z\mid X,Q)}_{\text{Grounding Head}}\cdot\underbrace{\pi_{\theta}^{\text{ans}}(y\mid X,Q,Z)}_{\text{Answer Head}}. (2) The Grounding Head is responsible for contextual grounding, selecting the evidence ZZ required to answer the question. The Answer Head then conditions on this selected evidence to produce the final output yy. 2.2 The Vanishing Grounding Gradient with Outcome-Only Rewards We now formally analyze the learning dynamics of the factorized policy (Eq. 2) when optimized solely with the final answer reward, rans(y)r_{\text{ans}}(y). We will demonstrate that this outcome-only signal is insufficient for learning the grounding head (πθgnd\pi_{\theta}^{\text{gnd}}), creating a fundamental bottleneck for long-context reasoning. Our analysis is based on a common property of long-context reasoning tasks: a correct solution often requires synthesizing a complete set of prerequisite evidence. Partial information, while helpful, typically yields a lower reward. That said, an LLM may occasionally answer correctly from a subset of GG or from alternative supporting evidence. This structure motivates the following formal assumption. Assumption 1 (Sparse Answer Reward). Let G⊆CG\subseteq C be the ground-truth set of essential evidence chunks. There exists a non-negative, monotone set function f:, G→ℝ≥0f:2^{G}\!\rightarrow\mathbb{R}_{\geq 0} with f(∅)=0f(\emptyset)=0 such that the expected answer reward conditioned on the selected set ZZ depends only on which ground-truth chunks are present: 𝔼[rans∣Z]=μ0+f(Z∩G),\mathbb{E}[r_{\text{ans}}\mid Z]=\mu_{0}+{f(Z\cap G)}, (3) where μ0\mu_{0} is a baseline reward from partial or spurious evidence. This form allows different chunks in GG to have different importance and credits arbitrary subsets Z∩GZ\cap G. To analyze the gradient, we introduce a logit sjs_{j} for each chunk cj∈Cc_{j}\in C and denote by zj=𝟏{cj∈Z}z_{j}=\mathbf{1}\{c_{j}\in Z\} its selection indicator. Let pj=Prθ⁡(cj∈Z)=𝔼θ[zj]p_{j}=\Pr_{\theta}(c_{j}\in Z)=\mathbb{E}_{\theta}[z_{j}] be the marginal selection probability under the grounding policy, we can derive the proposition below. Proposition 1 (Vanishing Gradients for Grounding). Under Assumption 1 and the grounding parameterization in Equation 9, the gradient of the expected answer reward with respect to the logit sjs_{j} for any essential chunk cj∈Gc_{j}\in G is: ∇sj𝔼[rans]=Cov(f(Z∩G),zj)=pj(1−pj)(𝔼[f(Z∩G)∣zj=1]−𝔼[f(Z∩G)∣zj=0]).

Papers on Lattice

Total citations

Topics

h-index

Research focus

Reasoning & Chain-of-Thought (1)RLHF & Preference Learning (1)Tool Use & Agents (1)

Frequent co-authors

Michael Shieh (1)Michael Qizhe Shieh (1)L. Bing (1)

Papers (1)

Mar 2, 2026

Guanzheng Chen +3Mar 2, 2026

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

RLHF struggles with long contexts because the reward signal for *finding* the right information vanishes, but can be revived by directly rewarding the model for selecting relevant context.

Guanzheng Chen, Michael Shieh, Michael Qizhe Shieh +1

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Search

Guanzheng Chen

Research focus

Frequent co-authors

Papers (1)