USCFeb 16, 2026arXiv:2602.15197

OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape, Yuhao Zhang, Jack Hessel

AI Summary

The paper introduces OpaqueToolsBench, a benchmark designed to evaluate LLM agents' ability to learn and utilize tools with underspecified documentation through interaction. The benchmark includes three environments: general function calling, interactive chess, and agentic search, all featuring opaque tools with unclear usage guidelines. The authors propose ToolObserver, a framework that iteratively refines tool documentation by observing execution feedback, demonstrating superior performance and token efficiency compared to existing methods on the benchmark.

Key Contribution

LLMs can learn to use poorly documented tools more effectively by iteratively refining documentation based on observed execution feedback, challenging the assumption of perfectly documented tools in existing benchmarks.

Abstract

Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks. While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general "search" APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque tools by interacting and subsequently improving documentation? To study this, we create OpaqueToolsBench, a benchmark consisting of three distinct task-oriented environments: general function calling, interactive chess playing, and long-trajectory agentic search. Each environment provides underspecified tools that models must learn to use effectively to complete the task. Results on OpaqueToolsBench suggest existing methods for automatically documenting tools are expensive and unreliable when tools are opaque. To address this, we propose a simple framework, ToolObserver, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories. Our approach outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings. Furthermore, for test-time tool exploration settings, our method is also efficient, consuming 3.5-7.5x fewer total tokens than the best baseline.

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

Related Papers