open-r1/README · [Experiment] Applying GRPO to DeepSeek-R1-Distill-Qwen-1.5B with LIMO

Open R1 org 29 days ago

•

In the DeepSeek-R1 tech report there is a hidden gem of advice about applying RL to their SFT distilled models:

I've started running some experiments to test the impact that GRPO can have on the existing distilled models. To keep things simple, I'm using the DeepSeek-R1-Distill-Qwen-1.5B model and small, yet high-quality LIMO dataset to iterate faster. I'll be using this discussion to track my progress, but feel free to chime in if you have ideas on how to improve the training!

Links

Weights and biases report with training metrics: https://api.wandb.ai/links/huggingface/3l5cglav
My branch in open-r1: https://github.com/huggingface/open-r1/tree/grpo-limo
Leaderboard to track downstream evals (search with limo to get the models): https://huggingface.co./spaces/open-r1/open-r1-eval-leaderboard

Experimental setup

Baseline parameters from v00.00 run:

LR = 1e-6
Number of tokens: 4096
Number of generations: 7
3 epochs
Effective batch size: 56 (8 per device, grad acc steps 1, 7 H100s). Per device batch size and grad acc steps tuned for max tokens 8192 (4, 2) and 16384 (2, 4)

Ablations over:

Number of tokens generated (v00.0X runs): 4096, 8192, 16384
Learning rate (v01.0X runs): 2e-6, 4e-6, 8e-6
Number of generations (v0.2.0X runs): 14, 28, 56
Optimizer (v03.0X runs): Paged Adam8Bit

Note: there is a bug in the format rewards (https://github.com/huggingface/open-r1/issues/237), so we should re-run the best params again once this is fixed.

Key takeaways so far

It really works! Depending on the hyperparameters, I'm able to get ~10 point boost on AIME24 and GPQA, with ~3 point boost on MATH-500 (likely saturated).
Generating more tokens gives larger rewards and better loss
Larger learning rates give larger rewards, but produce a "bump and dip" around 100 steps. The KL is also much larger.
Increasing the number of generations gives larger rewards, but also seems to induce more spikes in the loss/KL for some reason (maybe a bug in TRL?). The smoothest run appears to be N=14
The accuracy reward is rather flat, perhaps suggesting we are not generating enough tokens to emit the required \boxed{} answer.
Using 8-bit Paged AdamW doesn't seem to noticeably affect the training dynamics vs 32-bit AdamW (apart from KL being somewhat larger). This is great for memory!

More to come as I run more experiments 🤗

HarleyCooper

29 days ago

Do you have any insight to custom reward function design? That is where my versions of GRPO are failing.

lewtun

Open R1 org 29 days ago

Do you have any insight to custom reward function design? That is where my versions of GRPO are failing.

I am currently using the full set of reward functions we have in open-r1: https://github.com/huggingface/open-r1/blob/main/src/open_r1/grpo.py

I think the current biggest driver of performance is the accuracy rewards and reasoning steps reward:

lewtun pinned discussion 29 days ago

alucchi

29 days ago

Awesome! It would be nice to also report the batch size and gradient_accumulation_steps (couldn't find this reported anywhere, maybe I missed it). I would also be curious to know how sensitive the performance is to the beta parameter (weight of the KL term), it looks like it's set to 0.04 in all your experiments. If you look at the plots for the loss and KL term, they look very similar (up to beta=0.04) which seems to indicate that the KL term dominates in the loss and maybe you don't give enough weight to the reward function.

lewtun

Open R1 org 29 days ago

Great questions @alucchi ! I've added some notes on the batch size I'm using (effectively 64 for all runs). And good idea to scan over β: currently all runs have β=0.04, so I'm now checking the effect there

chewkokwah

29 days ago

AIME 2025 questions had just released, maybe you can use it as additional evaluation set as well.

meigel

29 days ago

Great work. You don't get reasoning with such a small model, do you? And can this be trained GPU poor (with unsloth model)?

baohao

29 days ago

I wonder why the effective batch size is 64. Shoudn't it be 56? since you use 1 GPU for vllm.

anirudhb11

29 days ago

•

edited 29 days ago

Thanks for the wonderful effort on reproducing R1. I was just curious if you tried using RL on top of one of the pretrained models (like Qwen2.5-1.5B) rather than the the models distilled from R1.
Was hoping to understand if we really need the distilled models for simpler datasets like MATH/GSM-8K? In other words is capability of these pretrained models (1-3B scale) sufficient for RL to improve ?

lewtun

Open R1 org 29 days ago

I wonder why the effective batch size is 64. Shoudn't it be 56? since you use 1 GPU for vllm.

Good catch, indeex it should be 56! Fixed now :)

lewtun

Open R1 org 29 days ago

Thanks for the wonderful effort on reproducing R1. I was just curious if you tried using RL on top of one of the pretrained models (like Qwen2.5-1.5B) rather than the the models distilled from R1.
Was hoping to understand if we really need the distilled models for simpler datasets like MATH/GSM-8K? In other words is capability of these pretrained models (1-3B scale) sufficient for RL to improve ?

We haven't done too many experiments with the base models yet as that DeepSeek-R1 tech report shows that distillation outperforms pure RL:

This suggests that the recipe for producing strong models is as follows:

Create synthetic data from R1 on various domains of interest
Run SFT with a smaller model
Apply GRPO to squeeze out additional performance

Dzmitry

27 days ago

@lewtun nice writeup! what if you used the same LIMA prompts to distill R1 instead of using these prompts for RL? It would be nice to separate two potential explanations for the improvement here.

apssg96

27 days ago

@lewtun can the training dataset for your experiments be released? In case it is I didn't find a link to it. Or perhaps I understood wrong and you are just passing the raw questions from AIME and GPQA to the model.

chankhavu

27 days ago

@lewtun Thanks for the detailed write up and doing all these experiments! You mentioned that you got +10pts on AIME for 1.5 model, but from the LB I see it's rather a spike than a consistent improvement? Or did I misunderstood something?

mirinflim

26 days ago

@lewtun Have you tried running something like this for AIME24?

lighteval vllm \
    "pretrained=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B,dtype=bfloat16,max_model_length=32768" \
    "custom|aime24|0|0" \
    --custom-tasks open-r1/src/open_r1/evaluate.py \
    --use-chat-template \
    --system-prompt "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>"

This was the result I got, which helps me believe the 55.5 reported in DeepSeek-R1-Distill-Qwen-7B's model card; possibly they used a better system prompt?

|     Task      |Version|     Metric     |Value|   |Stderr|
|---------------|------:|----------------|----:|---|-----:|
|all            |       |extractive_match|  0.5|±  |0.0928|
|custom:aime24:0|      1|extractive_match|  0.5|±  |0.0928|

jumptoliujj

25 days ago

What's the meaning of LIMO? GAIR/LIMO dataset? Got confused! Where can I find the description of RL training data? Thanks!

radna

20 days ago

Is the open-r1/LIMO not public? @lewtun

mirinflim

16 days ago

Hi @lewtun , i tried open-r1 on Qwen-2.5B Math, and also could see some improvement. Hope to share some of these results

I used slightly modified versions of open-r1 and trl. I detailed what was modified but it was not much. I just adjusted abit of the prompt and the format function. For trl i put in the PPO style clamping, and I enabled made small modifications to euse FSDP
I used 14 generations with grad accum = 2 and 7 devices for tuning, 1 for vllm

In the below plot is the evolution of the scores, as compared to the baseline. We take an average of 3 for all

we can see that the there was quite quick improvement of scores, starting from 100 steps.
for MATH 500, it further improved as the training went on

If you see the training curves.

can see that the accuracy reward jumped quite fast up, which may explain how we got good perf at 100 steps.
the accuracy continued to rise, and some point the MATH500 score improved further.

Some other notes

I tried to use base Qwen 2.5, and the results were not good. It seems that some extended pretraining on the domain is required.
I did not spend a lot of time tuning the hyperparameters. I did notice and improvement in quality of responses, such as it tries to verify the answer, but it is does not have very long reasoning traces like those of R1

cc: @rganti

radna

16 days ago

@mirinflim Is your open-r1 branch before the open-r1 team found and adjusted the chat template?

mirinflim

16 days ago

•

edited 16 days ago

@radna ah ok this one? i branched out two commits before this one, but I can see our changes are similar. Hence I feel my changes are not very important (though I did manage to use FSDP hence im using native bf16 training). Im mostly sharing my training observations

radna

16 days ago

@mirinflim Yes that one, though I don't know what difference training with FSDP vs ZeRO3 would make? Is it better to use FSDP on one single node? Also as for native bf16, is the current one used by open_r1 different? Have you got any luck training in FP8?

mirinflim

16 days ago

@radna FSDP gives an option to train in native 16 bit, while Deepspeed always upcasts its weights to 32. So in some sense FSDP is more flexible. I havnt tried FP8 yet