Tsinghua AIDeepSeekPrincetonFeb 11, 2025arXiv:2502.07640

Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving

Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, Chi Jin

AI Summary

The authors introduce Goedel-Prover, an open-source language model for automated formal proof generation, trained via supervised fine-tuning on a newly created dataset of 800K formal proofs (Goedel-Pset-v1-solved). This dataset was generated by iteratively training provers and adding newly proven statements to the training set. Goedel-Prover achieves state-of-the-art performance, surpassing DeepSeek-Prover-V1.5-RL on miniF2F with a 57.6% success rate and solving 7 problems on PutnamBench, and further RL training improves the success rate to over 60% on miniF2F.

Key Contribution

Open-sourcing Goedel-Prover vaults past proprietary models in automated theorem proving, proving 7 problems on PutnamBench and achieving a 60% success rate on miniF2F, all without RLHF.

Abstract

We introduce Goedel-Prover, an open-source language model that achieves state-of-the-art (as of April 5 2025) performance in automated formal proof generation for mathematical problems. A key challenge in this field is the scarcity of formalized mathematical statements and proofs, which we address through the following approaches. First, we train LLMs to convert natural language math problems from the Numina dataset to equivalent formal statements in Lean 4. This process creates the dataset Goedel-Pset-v1, which includes 1.64 million formal statements. Next, we develop a large dataset of formal proofs by training a series of provers. Each new prover can prove many statements that previous ones could not, and these new proofs are added to the training set for the next prover. Finally, we obtain the dataset Goedel-Pset-v1-solved, which contains proofs for over 800K statements from Goedel-Pset-v1. Supervised fine-tuning (SFT) of DeepSeek-Prover-V1.5-Base on Goedel-Pset-v1-solved (i.e., no RL) yields a Goedel-Prover-SFT that achieves a success rate of 57.6% (Pass@32) on miniF2F, surpassing the previous leader DeepSeek-Prover-V1.5-RL (trained using SFT + RL on a proprietary dataset) by 7.6%. On PutnamBench, Goedel-Prover-SFT successfully solves 7 problems (Pass@512), ranking first on the leaderboard. We provide extensive discussion of our training methodology, highlighting the key design choices that contribute to Goedel-Prover's strong performance. Further RL training (including DPO) improves Goedel-Prover-SFT's success rate to over 60% (Pass@32) on miniF2F. To aid future research, we provide extensive discussion of our training methodology and design choices. We also fully open-source our codes, models, and datasets. Additionally, we open-source formal proofs for 29.7K problems in Lean Workbook, nearly doubling the 15.7K solved by prior provers.

Code Generation & Program Synthesis Open-Source Models & Weights Reasoning & Chain-of-Thought

Citation Metrics

Citations82

Influential citations17

References51

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving

Related Papers