Axiomatic AIColumbiaCôte d’Azur UniversityENSEIRB-MATMECAEPFLImperialMathThe London School of Geometry and NumberUofTUT AustinUvAMar 3, 2026arXiv:2603.02668

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

Austin Letson, Leopoldo Sarra, Auguste Poiroux, Oliver Dressler, Paul Lezeau, Dhyan Aranha, Frederick Pu, Aaron Hill, Miguel Corredera Hidalgo, Julian Berman, George Tsoukalas, Lenny Taelman

AI Summary

The paper introduces SorryDB, a dynamically updated benchmark of Lean theorem proving tasks extracted from real-world formalization projects on GitHub, designed to address the limitations of static benchmarks and better reflect community needs. They evaluate a range of AI theorem proving approaches, including LLMs, agentic systems, and symbolic provers, on a snapshot of 1000 SorryDB tasks. Their results demonstrate that while a Gemini Flash-based agent achieves the best overall performance, it does not consistently outperform other methods, highlighting the complementary strengths of different approaches.

Key Contribution

Forget toy problems: SorryDB offers a continuously updated stream of real-world Lean formalization tasks, providing a robust benchmark for AI provers to contribute to novel mathematics.

Abstract

We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

Related Papers