Spaces:
Running
[Experiment] Applying GRPO to DeepSeek-R1-Distill-Qwen-1.5B with LIMO
In the DeepSeek-R1 tech report there is a hidden gem of advice about applying RL to their SFT distilled models:
I've started running some experiments to test the impact that GRPO can have on the existing distilled models. To keep things simple, I'm using the DeepSeek-R1-Distill-Qwen-1.5B
model and small, yet high-quality LIMO dataset to iterate faster. I'll be using this discussion to track my progress, but feel free to chime in if you have ideas on how to improve the training!
Links
- Weights and biases report with training metrics: https://api.wandb.ai/links/huggingface/3l5cglav
- My branch in
open-r1
: https://github.com/huggingface/open-r1/tree/grpo-limo - Leaderboard to track downstream evals (search with
limo
to get the models): https://huggingface.co./spaces/open-r1/open-r1-eval-leaderboard
Experimental setup
Baseline parameters from v00.00 run:
- LR = 1e-6
- Number of tokens: 4096
- Number of generations: 7
- 3 epochs
- Effective batch size: 56 (8 per device, grad acc steps 1, 7 H100s). Per device batch size and grad acc steps tuned for max tokens 8192 (4, 2) and 16384 (2, 4)
Ablations over:
- Number of tokens generated (v00.0X runs): 4096, 8192, 16384
- Learning rate (v01.0X runs): 2e-6, 4e-6, 8e-6
- Number of generations (v0.2.0X runs): 14, 28, 56
- Optimizer (v03.0X runs): Paged Adam8Bit
Note: there is a bug in the format rewards (https://github.com/huggingface/open-r1/issues/237), so we should re-run the best params again once this is fixed.
Key takeaways so far
- It really works! Depending on the hyperparameters, I'm able to get ~10 point boost on AIME24 and GPQA, with ~3 point boost on MATH-500 (likely saturated).
- Generating more tokens gives larger rewards and better loss
- Larger learning rates give larger rewards, but produce a "bump and dip" around 100 steps. The KL is also much larger.
- Increasing the number of generations gives larger rewards, but also seems to induce more spikes in the loss/KL for some reason (maybe a bug in TRL?). The smoothest run appears to be N=14
- The accuracy reward is rather flat, perhaps suggesting we are not generating enough tokens to emit the required
\boxed{}
answer. - Using 8-bit Paged AdamW doesn't seem to noticeably affect the training dynamics vs 32-bit AdamW (apart from KL being somewhat larger). This is great for memory!
More to come as I run more experiments ๐ค
Do you have any insight to custom reward function design? That is where my versions of GRPO are failing.
Do you have any insight to custom reward function design? That is where my versions of GRPO are failing.
I am currently using the full set of reward functions we have in open-r1: https://github.com/huggingface/open-r1/blob/main/src/open_r1/grpo.py
I think the current biggest driver of performance is the accuracy rewards and reasoning steps reward:
Awesome! It would be nice to also report the batch size and gradient_accumulation_steps (couldn't find this reported anywhere, maybe I missed it). I would also be curious to know how sensitive the performance is to the beta parameter (weight of the KL term), it looks like it's set to 0.04 in all your experiments. If you look at the plots for the loss and KL term, they look very similar (up to beta=0.04) which seems to indicate that the KL term dominates in the loss and maybe you don't give enough weight to the reward function.
AIME 2025 questions had just released, maybe you can use it as additional evaluation set as well.
Great work. You don't get reasoning with such a small model, do you? And can this be trained GPU poor (with unsloth model)?
I wonder why the effective batch size is 64. Shoudn't it be 56? since you use 1 GPU for vllm.
Thanks for the wonderful effort on reproducing R1. I was just curious if you tried using RL on top of one of the pretrained models (like Qwen2.5-1.5B) rather than the the models distilled from R1.
Was hoping to understand if we really need the distilled models for simpler datasets like MATH/GSM-8K? In other words is capability of these pretrained models (1-3B scale) sufficient for RL to improve ?
I wonder why the effective batch size is 64. Shoudn't it be 56? since you use 1 GPU for vllm.
Good catch, indeex it should be 56! Fixed now :)
Thanks for the wonderful effort on reproducing R1. I was just curious if you tried using RL on top of one of the pretrained models (like Qwen2.5-1.5B) rather than the the models distilled from R1.
Was hoping to understand if we really need the distilled models for simpler datasets like MATH/GSM-8K? In other words is capability of these pretrained models (1-3B scale) sufficient for RL to improve ?
We haven't done too many experiments with the base models yet as that DeepSeek-R1 tech report shows that distillation outperforms pure RL:
This suggests that the recipe for producing strong models is as follows:
- Create synthetic data from R1 on various domains of interest
- Run SFT with a smaller model
- Apply GRPO to squeeze out additional performance
@lewtun Have you tried running something like this for AIME24?
lighteval vllm \
"pretrained=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B,dtype=bfloat16,max_model_length=32768" \
"custom|aime24|0|0" \
--custom-tasks open-r1/src/open_r1/evaluate.py \
--use-chat-template \
--system-prompt "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>"
This was the result I got, which helps me believe the 55.5 reported in DeepSeek-R1-Distill-Qwen-7B
's model card; possibly they used a better system prompt?
| Task |Version| Metric |Value| |Stderr|
|---------------|------:|----------------|----:|---|-----:|
|all | |extractive_match| 0.5|ยฑ |0.0928|
|custom:aime24:0| 1|extractive_match| 0.5|ยฑ |0.0928|
What's the meaning of LIMO? GAIR/LIMO dataset? Got confused! Where can I find the description of RL training data? Thanks!
Hi @lewtun , i tried open-r1 on Qwen-2.5B Math, and also could see some improvement. Hope to share some of these results
- I used slightly modified versions of open-r1 and trl. I detailed what was modified but it was not much. I just adjusted abit of the prompt and the format function. For trl i put in the PPO style clamping, and I enabled made small modifications to euse FSDP
- I used 14 generations with grad accum = 2 and 7 devices for tuning, 1 for vllm
In the below plot is the evolution of the scores, as compared to the baseline. We take an average of 3 for all
- we can see that the there was quite quick improvement of scores, starting from 100 steps.
- for MATH 500, it further improved as the training went on
If you see the training curves.
- can see that the accuracy reward jumped quite fast up, which may explain how we got good perf at 100 steps.
- the accuracy continued to rise, and some point the MATH500 score improved further.
Some other notes
- I tried to use base Qwen 2.5, and the results were not good. It seems that some extended pretraining on the domain is required.
- I did not spend a lot of time tuning the hyperparameters. I did notice and improvement in quality of responses, such as it tries to verify the answer, but it is does not have very long reasoning traces like those of R1
cc: @rganti
@mirinflim Is your open-r1 branch before the open-r1 team found and adjusted the chat template?
@mirinflim Yes that one, though I don't know what difference training with FSDP vs ZeRO3 would make? Is it better to use FSDP on one single node? Also as for native bf16, is the current one used by open_r1 different? Have you got any luck training in FP8?