[Experiment] Training R1-Zero-like models with Open R1

#20
by lewtun - opened

Context

There are several recent research papers which explore various aspects of R1-Zero-like training on open base models like Qwen2.5-7B and Llama-3.1-8B:

These papers focus on mathematical reasoning (easy to verify) and do not always agree on the key factors needed for R1-Zero-like training. Since TRL now scales to large models, now is the time to train R1-Zero-like models with Open R1!

Main goal: reproduce / improve the performance of the DeepSeek-R1-Zero-Qwen-32B model that DeepSeek trained in the R1 tech report:

Screenshot 2025-03-30 at 21.00.59.png

Although DeepSeek found that pure RL performed worse than simple SFT distillation, the DAPO paper shows that by tweaking the GRPO training process, one can actually surpass the distilled model (at least on math):

Screenshot 2025-03-30 at 21.04.15.png

With that in mind, we will explore which subset of ideas in the above papers are sufficient to achieve comparable performance, starting first in math, then code and STEM.

We'll use this post and comments to track progress towards this goal - ideas and suggestions are more than welcome!

Setup

Links

Experiments to run

  1. Train a baseline using "standard" parameters on Big-Math and DAPO-Math-17k to compare relative performance & learning dynamics
  2. Measure effect on convergence with μ=2,4 (default is 1 in TRL)
  3. Disable KL term with β=0
  4. Clip higher with ε_low=0.2 and ε_high=0.28 (DAPO values)
  5. Add soft overlong reward function from DAPO paper
  6. Add overlong filter (mass loss of truncated completions)
  7. DAPO (default) vs Dr. GRPO loss

Features to add to TRL

  1. Overlong filter could be exposed as an arg like mask_truncated_completions in GRPOConfig
  2. Add logging to measure average stopped length and clip ratio (SimpleRL-Zoo) Done: https://github.com/huggingface/trl/pull/3188

Features to add to Open R1

  1. Add logging for pass@k accuracy (SimpleRL-Zero)
  2. Add reasoning behaviours callback with LLM APIs to track backtracking and other behaviours during training (SimpleRL-Zero)
    Screenshot 2025-03-30 at 21.25.50.png
lewtun pinned discussion

Logbook [1.4.2025]

Experiments

  • Focused on training a baseline with Qwen2.5-7B and discovered a serious bug in the accuracy reward function of open-r1 🙀. First, the parser was failing on non-LaTeX ground truth answers like "6", and second we were assigning a default reward of 1 when the ground truth could not be parsed. Fixed here: https://github.com/huggingface/open-r1/pull/566

W&B Chart 1_4_2025, 8_57_12 am.png

  • I am running 3 baseline experiments to gauge stability on SynthLabsAI/Big-Math-RL-Verified:
    • v00.0X: train on everything
    • v01.0X: train on "medium" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates
    • v02.0X: train on "hard" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates

Screenshot 2025-04-01 at 10.58.30.png

Overall training looks fairly stable, with accuracy rewards and completion lengths going up. The format reward is currently weighted with 0.2 and might need bumping up if the model cannot get enough signal to learn it. Note that I am using a chat template to define the DeepSeek-R1 prompt:

A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e., 
<think>
reasoning process here
</think>
<answer>
answer here
</answer>.

User: Given that the positive real numbers a and b satisfy a + b = 1, find the maximum value of sqrt(a) + sqrt(b).

Assistant: 

As many other papers has observed, Qwen2.5-7B is remarkably good at following instructions with little prompting and is able to emit the \boxed{} format fairly consistently without any reference to this in the prompt!

TRL / Open R1 updates

Next

  • Preprocess the BigMath dataset to filter any answers that cannot be parsed / verfied
  • Rebase on trl@main and re-run baseline to measure stability.
  • Gather downstream evals with pass@1 metric from lighteval: https://github.com/huggingface/lighteval/pull/647
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment