HKUSTOct 6, 2025arXiv:2510.04980

LLM-Hanabi: Evaluating Multi-Agent Gameplays with Theory-of-Mind and Rationale Inference in Imperfect Information Collaboration Game

Fangzhou Liang, Tianshi ZHENG, Chunkit Chan, Yauwai Yim, Yangqiu Song

AI Summary

The paper introduces LLM-Hanabi, a new benchmark for evaluating Theory-of-Mind (ToM) capabilities in Large Language Models (LLMs) within the context of the cooperative game Hanabi. The benchmark includes an automated evaluation system that measures both game performance and ToM proficiency, specifically focusing on first-order and second-order ToM. Results demonstrate a positive correlation between ToM and in-game success, with first-order ToM exhibiting a stronger correlation than second-order ToM, suggesting its greater importance for effective AI collaboration.

Key Contribution

LLMs playing Hanabi reveal that accurately guessing your partner's intentions matters more than predicting what they *think* you're thinking.

Abstract

Effective multi-agent collaboration requires agents to infer the rationale behind others'actions, a capability rooted in Theory-of-Mind (ToM). While recent Large Language Models (LLMs) excel at logical inference, their ability to infer rationale in dynamic, collaborative settings remains under-explored. This study introduces LLM-Hanabi, a novel benchmark that uses the cooperative game Hanabi to evaluate the rationale inference and ToM of LLMs. Our framework features an automated evaluation system that measures both game performance and ToM proficiency. Across a range of models, we find a significant positive correlation between ToM and in-game success. Notably, first-order ToM (interpreting others'intent) correlates more strongly with performance than second-order ToM (predicting others'interpretations). These findings highlight that for effective AI collaboration, the ability to accurately interpret a partner's rationale is more critical than higher-order reasoning. We conclude that prioritizing first-order ToM is a promising direction for enhancing the collaborative capabilities of future models.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations1

Influential citations0

References0

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

LLM-Hanabi: Evaluating Multi-Agent Gameplays with Theory-of-Mind and Rationale Inference in Imperfect Information Collaboration Game

Related Papers