CASFeb 23, 2026arXiv:2602.19526

How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

Yinuo Xu, Shuo Lu, Shuo Lu, Jianjie Cheng, Jianjie Cheng, Qianlong Xie, Qianlong Xie, Xingxing Wang, Xingxing Wang, Ran He, Jian Liang, Jian Liang

AI Summary

This paper investigates the impact of prompt engineering, reward function design, and policy optimization algorithms on reinforcement learning for deep research agents in the Search-R1 environment. The study demonstrates that a "Fast Thinking" prompt template outperforms the "Slow Thinking" template, F1-based rewards lead to training collapse unless action-level penalties are incorporated, and REINFORCE outperforms PPO and GRPO in this setting. Based on these findings, the authors develop Search-R1++, a new baseline that achieves significant performance improvements over the original Search-R1.

Key Contribution

Forget slow and steady: "Fast Thinking" prompts, combined with carefully tuned reward functions and REINFORCE, can dramatically boost the performance of RL-trained research agents.

Abstract

Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a strong baseline that improves the performance of Search-R1 from 0.403 to 0.442 (Qwen2.5-7B) and 0.289 to 0.331 (Qwen2.5-3B). We hope that our findings can pave the way for more principled and reliable RL training strategies in Deep Research systems.

Recommendation & Information Retrieval RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References44

Year2026

VenueN/A

Related Papers

Finding related papers...