Search papers, labs, and topics across Lattice.
This paper addresses the problem of poor tool retrieval performance in LLMs when faced with vague, real-world instructions, as opposed to the overly specific instructions found in academic benchmarks. They introduce VGToolBench, a new benchmark designed to simulate human vague instructions, and find that vague instructions significantly degrade tool retrieval performance. To mitigate this, they propose Tool Retrieval Bridge (TRB), a method that uses a bridge model to rewrite vague instructions into more specific ones, thereby aligning them with the retriever's preferences, achieving up to 111.51% relative improvement in BM25 retrieval.
LLMs struggle to retrieve the right tools when instructions are vague, but a simple "bridge model" that rephrases instructions can more than double retrieval accuracy.
Tool learning has emerged as a promising paradigm for large language models (LLMs) to address real-world challenges. Due to the extensive and irregularly updated number of tools, tool retrieval for selecting the desired tool subset is essential. However, current tool retrieval methods are usually based on academic benchmarks containing overly detailed instructions (e.g., specific API names and parameters), while real-world instructions are more vague. Such a discrepancy would hinder the tool retrieval in real-world applications. In this paper, we first construct a new benchmark, VGToolBench, to simulate human vague instructions. Based on this, we conduct a series of preliminary analyses and find that vague instructions indeed damage the performance of tool retrieval. To this end, we propose a simple-yet-effective Tool Retrieval Bridge (TRB) approach to boost the performance of tool retrieval for vague instructions. The principle of TRB is to introduce a bridge model to rewrite the vague instructions into more specific ones and alleviate the gap between vague instructions and retriever preferences.We conduct extensive experiments under multiple commonly used retrieval settings, and the results show that TRB effectively mitigates the ambiguity of vague instructions while delivering consistent and substantial improvements across all baseline retrievers. For example, with the help of TRB, BM25 achieves a relative improvement of up to 111.51%, i.e., increasing the average NDCG score from 9.73 to 19.59. The source code and models are publicly available at https://github.com/kfchenhn/TRB.