Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,51 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
---
|
4 |
+
# Godel-Prover: Pushing the Limit of Automated Theorem Proving Through Large Scale Data Synthesizing
|
5 |
+
|
6 |
+
## 1. Introduction
|
7 |
+
|
8 |
+
We introduce Godel-Prover-SFT.
|
9 |
+
|
10 |
+
<p align="center">
|
11 |
+
<img width="100%" src="figures/performance.png">
|
12 |
+
</p>
|
13 |
+
|
14 |
+
(Left) The performance of Pass@32 for full proof generation on miniF2F. Due to limited compute, we compare with DeepSeek-Prover-v1.5 on the Pass@32 metric (Table 1 of Xin et.al., ), which is different from Pass@32\*6400 in Fig. 1 of Xin et.al., The Pass@N metric indicates that we generate N proofs for a single problem; if any one of these N proofs successfully solves the problem, it is considered solved. (Middle) This sub-figure presents a comparison of Godel-Prover-SFT and Deepseek-Prover-v1.5 in terms of miniF2F performance across different inference budgets, ranging from Pass@32, 64, 128, ..., 4\*6400. (Right) The number of problems solved in Lean-workbook by Godel-Prover-SFT compared to existing works. InternLM2.5-Step-Prover and InternLM-Math-Plus collectively solve and open-source 16K samples, while we solve and open-source 29.7K samples. For a more detailed discussion on the comparison between Godel-Prover-SFT, Deepseek-Prover-v1.5-RL, and InternLM2.5-Step-Prover, please refer to Appendix of our paper.
|
15 |
+
|
16 |
+
## 2. Evaluation Results
|
17 |
+
|
18 |
+
<div align="center">
|
19 |
+
|
20 |
+
| Model |Compute (Pass)| miniF2F-test |
|
21 |
+
|------------------------|------------------|------------------|
|
22 |
+
| TheoremLamma | 128 | 33.6% |
|
23 |
+
| DeepSeek-Prover-V1 | 32 | 46.1% |
|
24 |
+
| DeepSeek-Prover-V1.5-SFT | 32 | 48.2% |
|
25 |
+
| DeepSeek-Prover-V1.5-RL | 32 | 50.0% |
|
26 |
+
| **Godel-Prover-SFT** | **32** | **57.6%** |
|
27 |
+
|------------------------|------------------|------------------|
|
28 |
+
| DeepSeek-Prover-V1.5-SFT | 3200 | 53.3% |
|
29 |
+
| DeepSeek-Prover-V1.5-RL | 3200 | 54.9% |
|
30 |
+
| **Godel-Prover-SFT** | **3200** | **62.7%** |
|
31 |
+
|------------------------|------------------|------------------|
|
32 |
+
| DeepSeek-Prover-V1.5-SFT | 25600 | 55.8% |
|
33 |
+
| DeepSeek-Prover-V1.5-RL | 25600 | 58.5% |
|
34 |
+
| **Godel-Prover-SFT** | **25600** | **64.7%** |
|
35 |
+
|
36 |
+
</div>
|
37 |
+
<div align="center">
|
38 |
+
MultiDataset
|
39 |
+
|
40 |
+
| Model |miniF2F| ProofNet | Lean-workbook |Our Held-out |
|
41 |
+
|------------------------|------------------|------------------|------------------|------------------|
|
42 |
+
|
43 |
+
</div>
|
44 |
+
|
45 |
+
<div align="center">
|
46 |
+
Putnam
|
47 |
+
|
48 |
+
| Rank |Type| Num-solved | Compute (Pass) |
|
49 |
+
|------------------------|------------------|------------------|------------------|
|
50 |
+
|
51 |
+
</div>
|