Safetensors
llama
linyongver commited on
Commit
fad47d4
·
verified ·
1 Parent(s): 5e1e7cd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -3
README.md CHANGED
@@ -1,3 +1,51 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ # Godel-Prover: Pushing the Limit of Automated Theorem Proving Through Large Scale Data Synthesizing
5
+
6
+ ## 1. Introduction
7
+
8
+ We introduce Godel-Prover-SFT.
9
+
10
+ <p align="center">
11
+ <img width="100%" src="figures/performance.png">
12
+ </p>
13
+
14
+ (Left) The performance of Pass@32 for full proof generation on miniF2F. Due to limited compute, we compare with DeepSeek-Prover-v1.5 on the Pass@32 metric (Table 1 of Xin et.al., ), which is different from Pass@32\*6400 in Fig. 1 of Xin et.al., The Pass@N metric indicates that we generate N proofs for a single problem; if any one of these N proofs successfully solves the problem, it is considered solved. (Middle) This sub-figure presents a comparison of Godel-Prover-SFT and Deepseek-Prover-v1.5 in terms of miniF2F performance across different inference budgets, ranging from Pass@32, 64, 128, ..., 4\*6400. (Right) The number of problems solved in Lean-workbook by Godel-Prover-SFT compared to existing works. InternLM2.5-Step-Prover and InternLM-Math-Plus collectively solve and open-source 16K samples, while we solve and open-source 29.7K samples. For a more detailed discussion on the comparison between Godel-Prover-SFT, Deepseek-Prover-v1.5-RL, and InternLM2.5-Step-Prover, please refer to Appendix of our paper.
15
+
16
+ ## 2. Evaluation Results
17
+
18
+ <div align="center">
19
+
20
+ | Model |Compute (Pass)| miniF2F-test |
21
+ |------------------------|------------------|------------------|
22
+ | TheoremLamma | 128 | 33.6% |
23
+ | DeepSeek-Prover-V1 | 32 | 46.1% |
24
+ | DeepSeek-Prover-V1.5-SFT | 32 | 48.2% |
25
+ | DeepSeek-Prover-V1.5-RL | 32 | 50.0% |
26
+ | **Godel-Prover-SFT** | **32** | **57.6%** |
27
+ |------------------------|------------------|------------------|
28
+ | DeepSeek-Prover-V1.5-SFT | 3200 | 53.3% |
29
+ | DeepSeek-Prover-V1.5-RL | 3200 | 54.9% |
30
+ | **Godel-Prover-SFT** | **3200** | **62.7%** |
31
+ |------------------------|------------------|------------------|
32
+ | DeepSeek-Prover-V1.5-SFT | 25600 | 55.8% |
33
+ | DeepSeek-Prover-V1.5-RL | 25600 | 58.5% |
34
+ | **Godel-Prover-SFT** | **25600** | **64.7%** |
35
+
36
+ </div>
37
+ <div align="center">
38
+ MultiDataset
39
+
40
+ | Model |miniF2F| ProofNet | Lean-workbook |Our Held-out |
41
+ |------------------------|------------------|------------------|------------------|------------------|
42
+
43
+ </div>
44
+
45
+ <div align="center">
46
+ Putnam
47
+
48
+ | Rank |Type| Num-solved | Compute (Pass) |
49
+ |------------------------|------------------|------------------|------------------|
50
+
51
+ </div>