Search papers, labs, and topics across Lattice.
The paper introduces MathDuels, a self-play benchmark where LLMs generate math problems adversarially and solve problems created by other models. This dual-role evaluation reveals that authoring and solving mathematical problems are partially decoupled skills in LLMs, uncovering performance differences not apparent in traditional, solver-only benchmarks. Experiments across 19 frontier models demonstrate the benchmark's ability to dynamically adapt difficulty and differentiate model capabilities as new, stronger models emerge.
LLMs that ace math exams can still be stumped by problems crafted by other LLMs, revealing a surprising gap between solving and problem-posing abilities.
As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.