Independent ResearcherJun 8, 2026arXiv:2606.10156

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

Bharat Narasimhan, Bharath Sivaram Narasimhan, Karthik R. Narasimhan

AI Summary

The paper introduces $\tau$-Rec, a benchmark designed to evaluate agentic recommender systems through verifiable rewards and a reveal-tagged elicitation (RTE) mechanism, addressing the limitations of subjective evaluations prevalent in current benchmarks. By systematically testing nine configurations across five model families, the authors uncover a significant reliability gap, with the best-performing model achieving only ~57% at pass^1 and ~38% at pass^4. This work highlights the urgent need for improved evaluation methods as recommender systems evolve into more complex conversational agents.

Key Contribution

Even the top-performing conversational agents struggle with reliability, hitting only 57% accuracy on a new benchmark designed to test agentic recommender systems.

Abstract

As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on"LLM-as-a-judge"evaluations, which introduce subjectivity, high costs and inconsistency. We present $\tau$-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, $\tau$-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References26

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

Related Papers