Goedel-LM
/

Goedel-Prover-SFT

Model card Files Files and versions Community

linyongver commited on Jan 28

Commit

fad47d4

·

verified ·

1 Parent(s): 5e1e7cd

Update README.md

Files changed (1) hide show

README.md +51 -3

README.md CHANGED Viewed

@@ -1,3 +1,51 @@
----
-license: mit
----

+---
+license: mit
+---
+# Godel-Prover: Pushing the Limit of Automated Theorem Proving Through Large Scale Data Synthesizing
+## 1. Introduction
+We introduce Godel-Prover-SFT.
+<p align="center">
+  <img width="100%" src="figures/performance.png">
+</p>
+(Left) The performance of Pass@32 for full proof generation on miniF2F. Due to limited compute, we compare with DeepSeek-Prover-v1.5 on the Pass@32 metric (Table 1 of Xin et.al., ), which is different from Pass@32\*6400 in Fig. 1 of Xin et.al., The Pass@N metric indicates that we generate N proofs for a single problem; if any one of these N proofs successfully solves the problem, it is considered solved.   (Middle) This sub-figure presents a comparison of Godel-Prover-SFT and Deepseek-Prover-v1.5 in terms of miniF2F performance across different inference budgets, ranging from Pass@32, 64, 128, ..., 4\*6400. (Right) The number of problems solved in Lean-workbook by Godel-Prover-SFT compared to existing works. InternLM2.5-Step-Prover and InternLM-Math-Plus collectively solve and open-source 16K samples, while we solve and open-source 29.7K samples. For a more detailed discussion on the comparison between Godel-Prover-SFT, Deepseek-Prover-v1.5-RL, and InternLM2.5-Step-Prover, please refer to Appendix of our paper.
+## 2. Evaluation Results
+<div align="center">
+| Model |Compute (Pass)|  miniF2F-test  |
+|------------------------|------------------|------------------|
+| TheoremLamma | 128 | 33.6% |
+| DeepSeek-Prover-V1 | 32 | 46.1% |
+| DeepSeek-Prover-V1.5-SFT | 32 | 48.2% |
+| DeepSeek-Prover-V1.5-RL | 32 | 50.0% |
+| **Godel-Prover-SFT** | **32** | **57.6%** |
+|------------------------|------------------|------------------|
+| DeepSeek-Prover-V1.5-SFT | 3200 | 53.3% |
+| DeepSeek-Prover-V1.5-RL | 3200 | 54.9% |
+| **Godel-Prover-SFT** | **3200** | **62.7%** |
+|------------------------|------------------|------------------|
+| DeepSeek-Prover-V1.5-SFT | 25600 | 55.8% |
+| DeepSeek-Prover-V1.5-RL | 25600 | 58.5% |
+| **Godel-Prover-SFT** | **25600** | **64.7%** |
+</div>
+<div align="center">
+MultiDataset
+| Model |miniF2F| ProofNet  | Lean-workbook |Our Held-out |
+|------------------------|------------------|------------------|------------------|------------------|
+</div>
+<div align="center">
+Putnam
+| Rank |Type| Num-solved  | Compute (Pass) |
+|------------------------|------------------|------------------|------------------|
+</div>