File size: 5,223 Bytes
f569aa3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
---
license: apache-2.0
---
# Eurus-2-7B-PRIME
## Links
- ๐ [Blog]()
- ๐ค [PRIME Collection](https://huggingface.co./PRIME-RL)
- ๐ค [RL Data]()
## Introduction
Eurus-2-7B-PRIME is trained using **PRIME** (**P**rocess **R**einforcement through **IM**plicit r**E**ward) method, which effectively incorporates and updates reward models in reinforcement learning. It starts with [Eurus-2-7B-SFT](https://huggingface.co./PRIME-RL/Eurus-2-7B-SFT) and trains on [Eurus-2-RL-Data]().
<img src="./figures/prm.gif" alt="prm" style="zoom: 33%;" />
As shown in the animation above, in PRIME, the policy model and PRM are both initialized with the SFT model. For each RL iteration, the policy model first generates rollouts. Then, the [implicit PRM](https://arxiv.org/abs/2412.01981) and outcome verifier score the rollouts, and the implicit PRM get updated on the rollouts with outcome reward. Finally, the outcome reward \\(r_o\\) and process reward \\(r_p\\) are combined and used to update the policy model.
The PRIME implementation pseudocode is as follows:
<img src="./figures/prime-algo.jpg" alt="prime-algo" style="zoom: 33%;" />
The algorithm flow includes:
1. **Prompt filtering** based on policy model performance, only preserving those on which the policy model \\(\pi_\theta\\) achieves a accuracy between 0.2 and 0.8.
2. **Calculate implicit process reward** \\(r^t\\).
3. **Update Implicit PRM** \\(\pi_\psi\\) based on predicted implicit process reward \\(r^t\\) and ground truth outcome label \\(r\\).
4. **Advantage estimation with RLOO.** Specifically, we first calculate the return of outcome rewards and implicit process rewards separately:
- For ground truth outcome rewards, we directly adopt RLOO without any modification.
- For implicit process rewards, we perform a three-step process to calculate return: (1) Use the averaged implicit process rewards to calculate the leave-one-out baseline (2) Normalize the process reward at step \\(t\\) by subtracting the baseline; (3) Calculate the discounted return for each response.
Finally, advantage is set to the combination of both returns.
โ 5. **Update the policy** \\(\pi_\theta\\) using PPO loss for legit importance sampling.
## Usage
We apply tailored prompts for coding and math task:
**System Prompt**
```
\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\n\n[ASSESS]\n\n[ADVANCE]\n\n[VERIFY]\n\n[SIMPLIFY]\n\n[SYNTHESIZE]\n\n[PIVOT]\n\n[OUTPUT]\n\nYou should strictly follow the format below:\n\n[ACTION NAME]\n\n# Your action step 1\n\n# Your action step 2\n\n# Your action step 3\n\n...\n\nNext action: [NEXT ACTION NAME]\n
```
**Coding**
```
{question} + "\n\nWrite Python code to solve the problem. Present the code in \n```python\nYour code\n```\nat the end.
```
**Math**
```
{question} + "\n\nPresent the answer in LaTex format: \\boxed{Your answer}"
```
## Evaluation
Through PRIME, we successfully achieved substantial improvement on key reasoning benchmarks compared with the SFT model, leading to over **14.7%** improvement on average, over **20%** on AMC&AIME competitions.
The final results are presented below:
| | **Eurus-2-7B-PRIME** | Epoch2-272step | **Eurus-2-7B-SFT** | **Qwen-2.5-Math-7B-Instruct** | **Llama-3.1-70B-Instruct** | **GPT-4o** |
| ------------- | -------------------- | -------------- | ------------------ | ----------------------------- | -------------------------- | ---------- |
| AIME 2024 | **23.3 (+20.0)** | 26.7 | 3.3 | 13.3 | 16.7 | 9.3 |
| MATH-500 | 77.2 (+12.1) | 79.2 | 65.1 | **79.8** | 64.6 | 76.4 |
| AMC | **55.4 (+25.3)** | 57.8 | 30.1 | 50.6 | 30.1 | 45.8 |
| Minerva Math | **39.3 (+6.6)** | 38.6 | 32.7 | 34.6 | 35.3 | 36.8 |
| OlympiadBench | 39.3 (+9.5) | 42.1 | 29.8 | 40.7 | 31.9 | **43.3** |
| Avg. | **46.9 (+14.7)** | 48.9 | 32.2 | 43.8 | 36.4 | 43.3 |
![image-20241230162026156](./figures/performance.jpg)
We achieved this with only 1/10 data and model resources compared with Qwen-Math.
| | **Eurus-2-7B-PRIME** | **Qwen2.5-Math-7B-Instruct** |
| ---------- | ---------------------------------- | ------------------------------- |
| Base Model | Qwen2.5-Math-7B | Qwen2.5-Math-7B |
| SFT Data | **230K (open-source)** | 2.5M (open-source and in-house) |
| RM Data | **0** | 618K (in-house) |
| RM | **Eurus-2-7B-SFT** | Qwen2.5-Math-RM (72B) |
| RL Data | **80K queries \\(\times\\)4 samples** | 66K queries \\(\times\\) 32 samples |
## Citation
```
``` |