Search papers, labs, and topics across Lattice.
This paper identifies and addresses issues with Group Relative Policy Optimization (GRPO) in deep search, specifically the mismatch between intermediate step correctness and reward, and training instability. They attribute these problems to coarse-grained advantage assignment and imbalance between positive and negative advantages. To mitigate these, they propose CalibAdv, an advantage calibration method that downscales excessive negative advantages based on intermediate step correctness and rebalances positive/negative advantages in the answer component, leading to improved performance and stability across multiple models and benchmarks.
GRPO's Achilles' heel in deep search is its coarse advantage assignment, but CalibAdv offers a way to surgically correct it, boosting both performance and training stability.
Deep search agents can autonomously initiate multi-turn interactions with search engines, thereby exhibiting strong question-answering capabilities. Such performance critically relies on Group Relative Policy Optimization (GRPO) as its core training algorithm. However, GRPO still faces several challenges in deep search settings. First, there exists a substantial mismatch between the correctness of intermediate steps and the reward signal, causing numerous correct intermediate steps to be incorrectly penalized when the final answer is wrong. Second, training is highly unstable, often resulting in degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse-grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method specifically designed for deep search tasks. Specifically, CalibAdv leverages the correctness of intermediate steps to downscale excessive negative advantages at a fine-grained level. It then rebalances positive and negative advantages in the answer component. Extensive experiments across three models and seven benchmarks demonstrate that CalibAdv improves both model performance and training stability. Our code is available at https://github.com/wujwyi/CalibAdv.