Amazon ScienceJun 22, 2026arXiv:2606.23937

When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents

Tianyu Ding, Juan Pablo De la Cruz Weinstein

AI Summary

This study critiques the reliance on exact-match retrieval recall as a measure of policy utility in long-horizon tool-use agents, specifically using Qwen2.5-3B/7B classifiers in the tau-bench framework. By comparing the performance of classifiers using gold-policy clauses versus top-ranked retrieved clauses, the authors find that while exact matches are rare, the retrieved clauses yield comparable classification performance (macro-F1 scores of 0.58 vs. 0.60). The findings suggest that traditional retrieval metrics may misrepresent the effectiveness of policy signals, advocating for a shift toward evaluating retrieved policies within the classification process itself.

Key Contribution

Exact-match retrieval metrics can mislead assessments of policy utility, as retrieved clauses perform nearly as well as gold-standard ones in decision-making tasks.

Abstract

Exact-match retrieval recall is often used as a proxy for whether a retriever supplies useful policy context to a downstream decision model. We test this proxy for pre-action policy classification in tau-bench using Qwen2.5-3B/7B classifiers. Under gold-policy conditioning, a compact structured state improves macro-F1 over raw trajectories by 0.13-0.17 after tuning. We then replace the benchmark-designated policy clause with the top-ranked clause retrieved from decision-time context. Although the exact governing clause is retrieved at rank 1 for only 7% of airline states, the primary 3B classifier obtains macro-F1 0.58 with retrieved clauses versus 0.60 with gold clauses (Delta=-0.02, task-cluster 95% CI [-0.23,+0.21]); mismatched-policy and no-policy controls score 0.32 and 0.21. We do not detect a macro-F1 difference between retrieved and gold clauses in this configuration, although the interval remains too wide to establish non-inferiority. The same qualitative pattern appears with a second retriever and at 7B, while varying across fine-tuning configurations. These results indicate that exact-match clause recall can underestimate downstream policy utility in this benchmark setting, motivating evaluation with retrieved policies in the classification loop rather than recall alone.

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents

Related Papers