Mar 4, 2026arXiv:2603.04370

τ-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Quan Shi, Alexandra Zytek, Alexandra Zytek, P. Razavi, Pedram Razavi, Karthik Narasimhan, Karthik R. Narasimhan, Victor Barres, Victor Barres

AI Summary

The paper introduces τ-Knowledge, an extension of τ-Bench, to evaluate conversational agents in knowledge-intensive settings requiring coordination of external knowledge and tool outputs. A new domain, τ-Banking, is presented, modeling fintech customer support workflows with approximately 700 interconnected knowledge documents and tool-mediated account updates. Experiments show that even advanced models struggle to retrieve relevant documents and reason over complex policies, achieving only ~25.5% pass rate, highlighting challenges in integrating unstructured knowledge for human-facing deployments.

Key Contribution

Even frontier models with high reasoning budgets fail to effectively navigate densely interlinked knowledge bases and complex policies in realistic fintech customer support scenarios, achieving only ~25.5% pass rate.

Abstract

Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating a gap in realistic, fully agentic evaluation over unstructured data in long-horizon interactions. We introduce τ-Knowledge, an extension of τ-Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes. Our new domain, τ-Banking, models realistic fintech customer support workflows in which agents must navigate roughly 700 interconnected knowledge documents while executing tool-mediated account updates. Across embedding-based retrieval and terminal-based search, even frontier models with high reasoning budgets achieve only sim25.5% pass^1, with reliability degrading sharply over repeated trials. Agents struggle to retrieve the correct documents from densely interlinked knowledge bases and to reason accurately over complex internal policies. Overall, τ-Knowledge provides a realistic testbed for developing agents that integrate unstructured knowledge in human-facing deployments.

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

τ-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Related Papers