AIGEN SciencesKorea USNUJun 28, 2025arXiv:2506.22853

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

Kyochul Jang, Donghyeon Lee, Kyusik Kim, Dongseok Heo, Taewhoo Lee, Woojeong Kim, Bongwon Suh

AI Summary

The paper introduces DICE-BENCH, a new benchmark for evaluating tool-use capabilities of large language models (LLMs) in multi-round, multi-party dialogues, addressing the limitations of existing single-turn benchmarks. They propose DICE-SCORE, a metric to quantify the dispersion of tool-related information in dialogues, revealing the inadequacy of current benchmarks. Experiments on 19 LLMs using DICE-BENCH demonstrate that significant improvements are needed for effective real-world deployment of LLMs in tool-use scenarios.

Key Contribution

Current function-calling benchmarks are too simple: DICE-BENCH reveals that LLMs still fail at realistic, multi-turn tool use.

Abstract

Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant advances are still required before such models can be deployed effectively in real-world settings. Our code and data are all publicly available: https://snuhcc.github.io/DICE-Bench/.

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations2

Influential citations0

References44

Year2025

VenueAnnual Meeting of the Association for Computational Linguistics

Related Papers

Finding related papers...

Search

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

Related Papers