Mar 31, 2026arXiv:2603.29761

Tracking vs. Deciding: The Dual-Capability Bottleneck in Searchless Chess Transformers

AI Summary

This paper investigates the limitations of training chess transformers from move sequences, identifying a "dual-capability bottleneck" where performance is constrained by the weaker of state tracking and decision quality. They show that low-rated games are crucial for learning state tracking, while high-rated games are essential for decision quality, and that simply scaling model size is insufficient. To overcome this, they scale the model to 120M parameters and introduce Elo-weighted training, achieving a Lichess bullet rating of 2570 and surpassing Maia-2 in human move prediction accuracy.

Key Contribution

Chess transformers trained solely on move sequences face a "dual-capability bottleneck" where excelling at both state tracking and decision-making requires carefully balancing data diversity and quality, a tension that simple scaling cannot resolve.

Abstract

A human-like chess engine should mimic the style, errors, and consistency of a strong human player rather than maximize playing strength. We show that training from move sequences alone forces a model to learn two capabilities: state tracking, which reconstructs the board from move history, and decision quality, which selects good moves from that reconstructed state. These impose contradictory data requirements: low-rated games provide the diversity needed for tracking, while high-rated games provide the quality signal for decision learning. Removing low-rated data degrades performance. We formalize this tension as a dual-capability bottleneck, P <= min(T,Q), where overall performance is limited by the weaker capability. Guided by this view, we scale the model from 28M to 120M parameters to improve tracking, then introduce Elo-weighted training to improve decisions while preserving diversity. A 2 x 2 factorial ablation shows that scaling improves tracking, weighting improves decisions, and their combination is superadditive. Linear weighting works best, while overly aggressive weighting harms tracking despite lower validation loss. We also introduce a coverage-decay formula, t* = log(N/kcrit)/log b, as a reliability horizon for intra-game degeneration risk. Our final 120M-parameter model, without search, reached Lichess bullet 2570 over 253 rated games. On human move prediction it achieves 55.2% Top-1 accuracy, exceeding Maia-2 rapid and Maia-2 blitz. Unlike position-based methods, sequence input naturally encodes full game history, enabling history-dependent decisions that single-position models cannot exhibit.

Architecture Design (Transformers, SSMs, MoE)Data Curation & Synthetic Data Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...