OxfordMar 16, 2026arXiv:2603.15617

HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

Erik Y. Wang, Sumeet Motwani, James V. Roggeveen, Eliot Hodges, Dulhan Jayalath, Charles London, Kalyan Ramakrishnan, Flaviu Cipcigan, Philip Torr, Alessandro Abate

AI Summary

The paper introduces HorizonMath, a benchmark of over 100 unsolved problems in computational and applied mathematics designed to evaluate AI's ability to make progress on open mathematical problems. The key innovation is the focus on problems where discovery is difficult but verification is computationally efficient, enabling automated evaluation and mitigating data contamination. Using HorizonMath, the authors found that GPT 5.4 Pro proposed solutions that improve upon the best-known published results for two problems, suggesting potential novel mathematical contributions.

Key Contribution

GPT 5.4 Pro may have made novel mathematical contributions, outperforming published results on two unsolved problems, as measured by the new HorizonMath benchmark.

Abstract

Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of over 100 predominantly unsolved problems spanning 8 domains in computational and applied mathematics, paired with an open-source evaluation framework for automated verification. Our benchmark targets a class of problems where discovery is hard, requiring meaningful mathematical insight, but verification is computationally efficient and simple. Because these solutions are unknown, HorizonMath is immune to data contamination, and most state-of-the-art models score near 0%. Existing research-level benchmarks instead rely on formal proof verification or manual review, both of which are expensive to scale. Using this platform, we find two problems for which GPT 5.4 Pro proposes solutions that improve on the best-known published results, representing potential novel contributions (pending expert review). We release HorizonMath as an open challenge and a growing community resource, where correct solutions to problems in the unsolved problem classes could constitute novel results in the mathematical literature.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

Related Papers