D observations intoApr 20, 2026arXiv:2604.18235

Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search

Jiayi Wu, Zeqian Huang, Lei Jiang, Can Xu, Kangyang Luo, Ming Gao, Xiang Li

AI Summary

This paper identifies and addresses issues with Group Relative Policy Optimization (GRPO) in deep search, specifically the mismatch between intermediate step correctness and reward, and training instability. They attribute these problems to coarse-grained advantage assignment and imbalance between positive and negative advantages. To mitigate these, they propose CalibAdv, an advantage calibration method that downscales excessive negative advantages based on intermediate step correctness and rebalances positive/negative advantages in the answer component, leading to improved performance and stability across multiple models and benchmarks.

Key Contribution

GRPO's Achilles' heel in deep search is its coarse advantage assignment, but CalibAdv offers a way to surgically correct it, boosting both performance and training stability.

Abstract

Deep search agents can autonomously initiate multi-turn interactions with search engines, thereby exhibiting strong question-answering capabilities. Such performance critically relies on Group Relative Policy Optimization (GRPO) as its core training algorithm. However, GRPO still faces several challenges in deep search settings. First, there exists a substantial mismatch between the correctness of intermediate steps and the reward signal, causing numerous correct intermediate steps to be incorrectly penalized when the final answer is wrong. Second, training is highly unstable, often resulting in degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse-grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method specifically designed for deep search tasks. Specifically, CalibAdv leverages the correctness of intermediate steps to downscale excessive negative advantages at a fine-grained level. It then rebalances positive and negative advantages in the answer component. Extensive experiments across three models and seven benchmarks demonstrate that CalibAdv improves both model performance and training stability. Our code is available at https://github.com/wujwyi/CalibAdv.

Recommendation & Information Retrieval RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...