BJTUKuaishouMar 10, 2026arXiv:2603.09203

Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

AI Summary

The paper introduces Evaluate-as-Action (EvalAct), a novel approach for training retrieval-augmented agents that explicitly evaluates the quality of each retrieval step. This is achieved by enforcing a Search-to-Evaluate protocol, where each retrieval is immediately followed by a structured evaluation score, providing fine-grained process signals. To leverage these signals, they propose Process-Calibrated Advantage Rescaling (PCAR), a GRPO-based optimization method that rescales advantages based on evaluation scores. EvalAct demonstrates state-of-the-art performance on seven open-domain QA benchmarks, particularly excelling in multi-hop tasks.

Key Contribution

Retrieval-augmented agents get a serious reasoning boost by explicitly evaluating their own retrieval quality at each step, leading to state-of-the-art performance on multi-hop question answering.

Abstract

Retrieval-augmented agents can query external evidence, yet their reliability in multi-step reasoning remains limited: noisy retrieval may derail multi-hop question answering, while outcome-only reinforcement learning provides credit signals that are too coarse to optimize intermediate steps. We propose \textsc{EvalAct} (Evaluate-as-Action), which converts implicit retrieval quality assessment into an explicit action and enforces a coupled Search-to-Evaluate protocol so that each retrieval is immediately followed by a structured evaluation score, yielding process signals aligned with the interaction trajectory. To leverage these signals, we introduce Process-Calibrated Advantage Rescaling (PCAR), a GRPO-based optimization method that rescales advantages at the segment level according to evaluation scores, emphasizing reliable segments while updating uncertain ones conservatively. Experiments on seven open-domain QA benchmarks show that \textsc{EvalAct} achieves the best average accuracy, with the largest gains on multi-hop tasks, and ablations verify that the explicit evaluation loop drives the primary improvements while PCAR provides consistent additional benefits.

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...