Search papers, labs, and topics across Lattice.
The paper introduces SynPlanResearch-R1, a framework that synthesizes tool-use trajectories to encourage deeper exploration during the supervised fine-tuning of research agents. This approach addresses the problem of poor exploration behaviors, such as premature termination and biased tool usage, that limit the effectiveness of reinforcement learning with verifiable rewards (RLVR). By providing a strong initialization for RL, SynPlanResearch-R1 improves performance by up to 6.0% on Qwen3-8B and 5.8% on Qwen3-4B backbones across seven multi-hop and open-web benchmarks.
Jumpstart your research agent: synthetic tool-use plans overcome exploration bottlenecks and boost performance by up to 6% on multi-hop reasoning tasks.
Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL. Across seven multi-hop and open-web benchmarks, \framework improves performance by up to 6.0% on Qwen3-8B and 5.8% on Qwen3-4B backbones respectively compared to SOTA baselines. Further analyses of tool-use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at https://github.com/HansiZeng/syn-plan-research.