Feb 19, 2026arXiv:2602.17443

AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue

Adib Sakhawat, Fardeen Sadab, Rakin Shahriar

AI Summary

The paper introduces AIDG, a game-theoretic framework with two tasks (AIDG-I and AIDG-II) to evaluate the asymmetry between information extraction (deduction) and information containment in multi-turn dialogues with LLMs. Through 439 games with six LLMs, the authors demonstrate that models exhibit a significant performance gap, excelling at containment (defense) but struggling with deduction (active inquiry), quantified by a 350 ELO advantage on defense. The study identifies information dynamics and constraint adherence as key bottlenecks contributing to this asymmetry, highlighting the challenges LLMs face in global state tracking for strategic reasoning.

Key Contribution

LLMs are surprisingly bad at strategic deduction in multi-turn dialogues, despite excelling at information containment, revealing a critical asymmetry in their reasoning abilities.

Abstract

Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment (state maintenance) in dialogue. We propose two complementary tasks: AIDG-I, measuring pragmatic strategy in social deduction, and AIDG-II, measuring constraint satisfaction in a structured "20 Questions" setting. Across 439 games with six frontier LLMs, we observe a clear capability asymmetry: models perform substantially better at containment than deduction, with a 350 ELO advantage on defense;(Cohen's d = 5.47). We identify two bottlenecks driving this gap: (1) Information Dynamics, where confirmation strategies are 7.75x more effective than blind deduction (p < 0.00001), and (2) Constraint Adherence, where instruction-following degrades under conversational load, accounting for 41.3% of deductive failures. These findings suggest that while LLMs excel at local defensive coherence, they struggle with the global state tracking required for strategic inquiry.

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue

Related Papers