Tsinghua AIMay 29, 2026arXiv:2605.31584

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

AI Summary

LongTraceRL enhances long-context reasoning in LLMs by training on challenging contexts generated from search agent trajectories and rewarding correct answers based on a rubric that evaluates entity-level reasoning steps. This approach constructs tiered distractors, differentiating between highly confusable documents read but not cited and less confusable documents appearing in search results but never opened. By applying a positive-only rubric reward based on gold entities along the reasoning chain, LongTraceRL guides models to improve reasoning quality among correct responses and avoid reward hacking. Experiments across five long-context benchmarks show that LongTraceRL outperforms strong baselines, promoting more comprehensive and evidence-grounded reasoning in models ranging from 4B to 30B parameters.

Key Contribution

LLMs can be taught to reason more comprehensively over long contexts by rewarding not just the final answer, but also the quality of the reasoning steps taken to arrive at that answer.

Abstract

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce LongTraceRL. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build tiered distractors: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a rubric reward that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that LongTraceRL consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at https://github.com/THU-KEG/LongTraceRL{https://github.com/THU-KEG/LongTraceRL}.

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...