Search papers, labs, and topics across Lattice.
This paper introduces a text-based Clue environment to evaluate multi-step deductive reasoning in LLMs. Six agents based on GPT-4o-mini and Gemini-2.5-Flash were tested, revealing a significant struggle to maintain consistent deductive reasoning throughout a game. Fine-tuning on structured logic puzzles failed to reliably improve performance, sometimes even decreasing reasoning precision.
LLMs can't crack Clue: even state-of-the-art models struggle with multi-step deductive reasoning in a simulated text-based game, and fine-tuning doesn't reliably help.
Deducing whodunit proves challenging for LLM agents. In this paper, we implement a text-based multi-agent version of the classic board game Clue as a rule-based testbed for evaluating multi-step deductive reasoning, with six agents drawn from GPT-4o-mini and Gemini-2.5-Flash. We further investigate whether fine-tuning on structured logic puzzles transfers to improved in-game reasoning and gameplay. Across 18 simulated games, agents achieve only four correct wins, indicating difficulty in maintaining consistent deductive reasoning over the course of a full game. Additionally, we find that fine-tuning does not reliably improve performance and, in some cases, appears to increase reasoning volume without improving reasoning precision.