DAMOChongqing Ant Consumer Finance Co.NTUZJUFeb 12, 2026arXiv:2602.11551

SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent

Jinluan Yang, Yiquan Wu, Yi Liu, Jianhang Yao, Kun Kuang

AI Summary

The paper introduces SIGHT, a reinforcement learning framework designed to improve search-based reasoning in LLMs by mitigating redundancy and noise in search results. SIGHT uses Self-Evidence Support (SES) to distill search results into high-fidelity evidence and employs an Information Gain score to identify pivotal states for Dynamic Prompting Interventions like de-duplication and adaptive branching. By integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT achieves superior performance on single-hop and multi-hop QA benchmarks with fewer search steps compared to existing methods.

Key Contribution

LLMs can overcome "tunnel vision" in multi-turn search scenarios by using information gain to guide dynamic prompting interventions, leading to more efficient and accurate reasoning.

Abstract

Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily fall into"Tunnel Vision,"where the forced interpretation of early noisy retrievals leads to irreversible error accumulation. To address these challenges, we propose SIGHT, a framework that enhances search-based reasoning through Self-Evidence Support (SES) and Information-Gain Driven Diverse Branching. SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states where observations maximally reduce uncertainty. This score guides Dynamic Prompting Interventions - including de-duplication, reflection, or adaptive branching - to spawn new branches with SES. Finally, by integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT internalizes robust exploration strategies without external verifiers. Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches, particularly in complex reasoning scenarios, using fewer search steps.

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References33

Year2026

VenueN/A

Related Papers

Finding related papers...