Apr 23, 2026arXiv:2604.21916

MathDuels: Evaluating LLMs as Problem Posers and Solvers

Zhiqiu Xu, Shibo Jin, Shreyash Arya, Mayur Naik

AI Summary

The paper introduces MathDuels, a self-play benchmark where LLMs generate math problems adversarially and solve problems created by other models. This dual-role evaluation reveals that authoring and solving mathematical problems are partially decoupled skills in LLMs, uncovering performance differences not apparent in traditional, solver-only benchmarks. Experiments across 19 frontier models demonstrate the benchmark's ability to dynamically adapt difficulty and differentiate model capabilities as new, stronger models emerge.

Key Contribution

LLMs that ace math exams can still be stumped by problems crafted by other LLMs, revealing a surprising gap between solving and problem-posing abilities.

Abstract

As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References51

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MathDuels: Evaluating LLMs as Problem Posers and Solvers

Related Papers