Mar 17, 2026arXiv:2603.17169

How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment

AI Summary

This paper introduces a text-based Clue environment to evaluate multi-step deductive reasoning in LLMs. Six agents based on GPT-4o-mini and Gemini-2.5-Flash were tested, revealing a significant struggle to maintain consistent deductive reasoning throughout a game. Fine-tuning on structured logic puzzles failed to reliably improve performance, sometimes even decreasing reasoning precision.

Key Contribution

LLMs can't crack Clue: even state-of-the-art models struggle with multi-step deductive reasoning in a simulated text-based game, and fine-tuning doesn't reliably help.

Abstract

Deducing whodunit proves challenging for LLM agents. In this paper, we implement a text-based multi-agent version of the classic board game Clue as a rule-based testbed for evaluating multi-step deductive reasoning, with six agents drawn from GPT-4o-mini and Gemini-2.5-Flash. We further investigate whether fine-tuning on structured logic puzzles transfers to improved in-game reasoning and gameplay. Across 18 simulated games, agents achieve only four correct wins, indicating difficulty in maintaining consistent deductive reasoning over the course of a full game. Additionally, we find that fine-tuning does not reliably improve performance and, in some cases, appears to increase reasoning volume without improving reasoning precision.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References26

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment

Related Papers