Apr 16, 2026arXiv:2604.15148

IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

Zihan Liang, Yufei Ma, Ben Chen, Ben Chen, Zhipeng Qian, Zhipeng Qian, Huangyu Dai, Lingtao Mao, Lingtao Mao, Xuxin Zhang, Xuxin Zhang, Chenyi Lei, Chenyi Lei, Wenwu Ou, Wenwu Ou

AI Summary

IG-Search is introduced, a reinforcement learning framework that uses step-level Information Gain (IG) to reward effective search queries in search-augmented reasoning. IG measures the improvement in model confidence towards the correct answer after retrieving documents, compared to a baseline of random documents, providing a fine-grained signal for credit assignment. Experiments on question-answering benchmarks show that IG-Search outperforms trajectory-level and other step-level methods, especially on multi-hop reasoning tasks, while adding minimal overhead to training and inference.

Key Contribution

Reward your LLM's search queries like a discerning librarian: IG-Search uses information gain to give step-by-step feedback, boosting multi-hop reasoning without the annotation overhead.

Abstract

Reinforcement learning has emerged as an effective paradigm for training large language models to perform search-augmented reasoning. However, existing approaches rely on trajectory-level rewards that cannot distinguish precise search queries from vague or redundant ones within a rollout group, and collapse to a near-zero gradient signal whenever every sampled trajectory fails. In this paper, we propose IG-Search, a reinforcement learning framework that introduces a step-level reward based on Information Gain (IG). For each search step, IG measures how much the retrieved documents improve the model's confidence in the gold answer relative to a counterfactual baseline of random documents, thereby reflecting the effectiveness of the underlying search query. This signal is fed back to the corresponding search-query tokens via per-token advantage modulation in GRPO, enabling fine-grained, step-level credit assignment within a rollout. Unlike prior step-level methods that require either externally annotated intermediate supervision or shared environment states across trajectories, IG-Search derives its signals from the policy's own generation probabilities, requiring no intermediate annotations beyond standard question-answer pairs. Experiments on seven single-hop and multi-hop QA benchmarks demonstrate that IG-Search achieves an average EM of 0.430 with Qwen2.5-3B, outperforming the strongest trajectory-level baseline (MR-Search) by 1.6 points and the step-level method GiGPO by 0.9 points on average across benchmarks, with particularly pronounced gains on multi-hop reasoning tasks. Despite introducing a dense step-level signal, IG-Search adds only ~6.4% to per-step training wall-clock time over the trajectory-level baseline and leaves inference latency unchanged, while still providing a meaningful gradient signal even when every sampled trajectory answers incorrectly.

Reasoning & Chain-of-Thought Recommendation & Information Retrieval RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...