Spaces:
Running
[Experiment] Training R1-Zero-like models with Open R1
Context
There are several recent research papers which explore various aspects of R1-Zero-like training on open base models like Qwen2.5-7B and Llama-3.1-8B:
- DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
- Understanding R1-Zero-Like Training: A Critical Perspective
- Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
These papers focus on mathematical reasoning (easy to verify) and do not always agree on the key factors needed for R1-Zero-like training. Since TRL now scales to large models, now is the time to train R1-Zero-like models with Open R1!
Main goal: reproduce / improve the performance of the DeepSeek-R1-Zero-Qwen-32B model that DeepSeek trained in the R1 tech report:
Although DeepSeek found that pure RL performed worse than simple SFT distillation, the DAPO paper shows that by tweaking the GRPO training process, one can actually surpass the distilled model (at least on math):
With that in mind, we will explore which subset of ideas in the above papers are sufficient to achieve comparable performance, starting first in math, then code and STEM.
We'll use this post and comments to track progress towards this goal - ideas and suggestions are more than welcome!
Setup
- Models: Qwen2.5-7B for ablations and Qwen2.5-32B for final runs
- Datasets: SynthLabsAI/Big-Math-RL-Verified and BytedTsinghua-SIA/DAPO-Math-17k for math. Code and other domains to be decided.
Links
- Code: I'll be running experiments from this draft PR of
open-r1
: https://github.com/huggingface/open-r1/pull/569 - Experiment logs: https://api.wandb.ai/links/huggingface/179dbkli
- Models and datasets: https://huggingface.co./collections/open-r1/open-r1-zero-67eba6a037505bbcb5157d07
Experiments to run
- Train a baseline using "standard" parameters on Big-Math and DAPO-Math-17k to compare relative performance & learning dynamics
- Measure effect on convergence with μ=2,4 (default is 1 in TRL)
- Disable KL term with
β=0
- Clip higher with
ε_low=0.2
andε_high=0.28
(DAPO values) - Add soft overlong reward function from DAPO paper
- Add overlong filter (mass loss of truncated completions)
- DAPO (default) vs Dr. GRPO loss
Features to add to TRL
- Overlong filter could be exposed as an arg like
mask_truncated_completions
inGRPOConfig
Add logging to measure average stopped length and clip ratio (SimpleRL-Zoo)Done: https://github.com/huggingface/trl/pull/3188
Features to add to Open R1
Logbook [1.4.2025]
Experiments
- Focused on training a baseline with
Qwen2.5-7B
and discovered a serious bug in the accuracy reward function ofopen-r1
🙀. First, the parser was failing on non-LaTeX ground truth answers like"6"
, and second we were assigning a default reward of 1 when the ground truth could not be parsed. Fixed here: https://github.com/huggingface/open-r1/pull/566
- I am running 3 baseline experiments to gauge stability on
SynthLabsAI/Big-Math-RL-Verified
:- v00.0X: train on everything
- v01.0X: train on "medium" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates
- v02.0X: train on "hard" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates
Overall training looks fairly stable, with accuracy rewards and completion lengths going up. The format reward is currently weighted with 0.2 and might need bumping up if the model cannot get enough signal to learn it. Note that I am using a chat template to define the DeepSeek-R1 prompt:
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e.,
<think>
reasoning process here
</think>
<answer>
answer here
</answer>.
User: Given that the positive real numbers a and b satisfy a + b = 1, find the maximum value of sqrt(a) + sqrt(b).
Assistant:
As many other papers has observed, Qwen2.5-7B
is remarkably good at following instructions with little prompting and is able to emit the \boxed{}
format fairly consistently without any reference to this in the prompt!
TRL / Open R1 updates
- @edbeeching has added the new completion metrics here: https://github.com/huggingface/trl/pull/3188
- @ShirinYamani has added the soft overlong reward function: https://github.com/huggingface/open-r1/pull/567
Next
- Preprocess the BigMath dataset to filter any answers that cannot be parsed / verfied
- Rebase on
trl@main
and re-run baseline to measure stability. - Gather downstream evals with
pass@1
metric fromlighteval
: https://github.com/huggingface/lighteval/pull/647