Search papers, labs, and topics across Lattice.
This paper introduces Qi Town, a novel adversarial benchmarking framework for evaluating LLMs' strategic reasoning and mental fitness using board game competitions, addressing the limitations of Q&A-based benchmarks. The framework supports 5 games, involves 20 LLM players, and employs Elo ratings and a Performance Loop Graph (PLG) to assess technical capabilities, alongside a Positive Sentiment Score (PSS) to gauge mental fitness. Experiments reveal LLMs exhibit optimism regardless of winning or losing, but also demonstrate instability in skill play as highlighted by cyclic wins and losses in PLGs.
LLMs maintain a positive attitude even when losing in adversarial board games, yet their gameplay reveals surprising instability in skill execution.
Adversarial board games, as a paradigmatic domain of strategic reasoning and intelligence, have long served as both a popular competitive activity and a benchmark for evaluating artificial intelligence (AI) systems. Building on this foundation, we propose an adversarial benchmarking framework to assess the comprehensive performance of Large Language Models (LLMs) through board games competition, compensating the limitation of data dependency of the mainstream Question-and-Answer (Q&A) based benchmark method. We introduce Qi Town, a specialized evaluation platform that supports 5 widely played games and involves 20 LLM-driven players. The platform employs both the Elo rating system and a novel Performance Loop Graph (PLG) to quantitatively evaluate the technical capabilities of LLMs, while also capturing Positive Sentiment Score (PSS) throughout gameplay to assess mental fitness. The evaluation is structured as a round-robin tournament, enabling systematic comparison across players. Experimental results indicate that, despite technical differences, most LLMs remain optimistic about winning and losing, demonstrating greater adaptability to high-stress adversarial environments than humans. On the other hand, the complex relationship between cyclic wins and losses in PLGs exposes the instability of LLMs'skill play during games, warranting further explanation and exploration.