@Jaward on Hugging Face: "Finally here it is: a faster, custom, scalable GRPO trainer for smaller models…"

Jaward

posted an update 4 days ago

Post

3737

Finally here it is: a faster, custom, scalable GRPO trainer for smaller models with < 500M params, can train on 8gb ram cpu, also supports gpu for sanity sake (includes support for vllm + flash attention). Using smolLM2-135M/360M-instructs as ref & base models. Experience your own “aha” moment 🐳 on 8gb ram.
Code: https://github.com/Jaykef/ai-algorithms/blob/main/smollm2_360M_135M_grpo_gsm8k.ipynb

LeroyDyer

2 days ago

Great stuff , but the unsloth versions are easier and better my friend . there is no need to re invent the wheel .. perhaps an addition to the existing unsloth such as new rewards for different aspects of the input which could be added !! you did add a feature for some specific terms .... i think this was interesting ... but separate this reward... i a can always ask deepseek to simplfy your code or covert it to unsloth ... but you can do this bro !

Jaward

2 days ago

•

edited about 15 hours ago

bro if you had read the repo you would see that this implementation is for educational purpose, it's not done because it's easy. Not to mention unsloth is using trl's GRPO trainer which is super slow on cpu and does not scale for models under 500M params, I tried it both on cpu and gpu. This custom implementation cuts most of the heavy lifting allowing you to train and scale faster even on cpu, plus a bunch of custom configs with a simplified GRPO trainer in under 500 lines of code. There's a lot one can learn from it.

Join the conversation