Tencent AIFeb 26, 2026arXiv:2602.22576

Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Tianle Xia, Tianle Xia, Mingde Xu, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li, Wenwei Li, Linfang Shang, Linfang Shang, Liqun Liu, Liqun Liu, Peng Shu, Peng Shu, Huan Yu, Jie Jiang, Jie Jiang

AI Summary

The paper introduces Search-P1, a novel reward shaping framework for training agentic RAG systems that addresses the limitations of sparse rewards and low sample efficiency in existing RL-based methods. Search-P1 uses a path-centric reward that evaluates reasoning trajectories based on step coverage and soft scoring, extracting learning signals even from unsuccessful attempts. By incorporating dual-track path scoring with offline-generated reference planners, Search-P1 assesses path quality from both self-consistency and reference-alignment perspectives, leading to improved performance on QA benchmarks.

Key Contribution

Agentic RAG gets a 7.7 point accuracy boost thanks to Search-P1's path-centric reward shaping, which extracts learning signals even from failed reasoning attempts.

Abstract

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.

Recommendation & Information Retrieval Tool Use & Agents Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References29

Year2026

VenueN/A

Related Papers

Finding related papers...