University of Chineses Academy of ChinesesMay 5, 2026arXiv:2605.04018

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

Yilun Zhao, Jinbiao Wei, Tingyu Song, Siyue Zhang, Chen Zhao, Arman Cohan

AI Summary

The paper introduces BRIGHT-Pro, an expert-annotated benchmark for reasoning-intensive retrieval that expands queries with multi-aspect gold evidence and evaluates retrievers in static and agentic search settings. They also construct RTriever-Synth, a synthetic corpus designed to train retrievers to generate complementary positives and positive-conditioned hard negatives. Fine-tuning RTriever-4B on RTriever-Synth leads to significant improvements over its base model, particularly when evaluated using aspect-aware and agentic search protocols, revealing limitations of standard evaluation metrics.

Key Contribution

Standard retriever evaluations hide critical weaknesses in agentic search systems, but a new benchmark and training method exposes and addresses these flaws.

Abstract

Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is increasingly important for agentic search systems, where retrievers must provide complementary evidence across iterative search and synthesis. However, existing work remains limited on both evaluation and training: benchmarks such as BRIGHT provide narrow gold sets and evaluate retrievers in isolation, while synthetic training corpora often optimize single-passage relevance rather than evidence portfolio construction. We introduce BRIGHT-Pro, an expert-annotated benchmark that expands each query with multi-aspect gold evidence and evaluates retrievers under both static and agentic search protocols. We further construct RTriever-Synth, an aspect-decomposed synthetic corpus that generates complementary positives and positive-conditioned hard negatives, and use it to LoRA fine-tune RTriever-4B from Qwen3-Embedding-4B. Experiments across lexical, general-purpose, and reasoning-intensive retrievers show that aspect-aware and agentic evaluation expose behaviors hidden by standard metrics, while RTriever-4B substantially improves over its base model.

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References50

Year2026

VenueN/A

Related Papers

Finding related papers...