CUHKLi AutoFeb 27, 2025arXiv:2502.20073

Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Haochen Sun, Shuwen Zhang, Lei Ren, Hao Xu, Hao Fu, Caixia Yuan, Xiaojie Wang

AI Summary

This paper introduces Collab-Overcooked, a new benchmark built on the Overcooked-AI game, designed to evaluate the collaborative capabilities of LLM-based Multi-Agent Systems (LLM-MAS) in interactive environments. The benchmark features diverse tasks and objectives, encouraging collaboration through natural language communication, and introduces process-oriented evaluation metrics to assess fine-grained collaboration skills. Experiments with 13 popular LLMs reveal strengths in goal interpretation but weaknesses in active collaboration and continuous adaptation, highlighting areas for improvement in LLM-MAS.

Key Contribution

LLMs struggle to actively collaborate and continuously adapt in complex, interactive environments, despite showing proficiency in goal interpretation.

Abstract

Large Language Models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-based Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks in two novel ways. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments with 13 popular LLMs and show that, while the LLMs exhibit a strong ability in goal interpretation, there are significant shortcomings in active collaboration and continuous adaptation, which are critical for efficiently fulfilling complex tasks. Notably, we highlight the strengths and weaknesses of LLM-MAS and provide insights for improving and evaluating LLM-MAS on a unified and open-source benchmark. The environments, 30 open-ended tasks, and the evaluation package are publicly available at https://github.com/YusaeMeow/Collab-Overcooked.

Eval Frameworks & Benchmarks Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations11

Influential citations0

References45

Year2025

VenueConference on Empirical Methods in Natural Language Processing

Related Papers

Finding related papers...

Search

Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Related Papers