Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Jaward 
posted an update 4 days ago
Post
3737
Finally here it is: a faster, custom, scalable GRPO trainer for smaller models with < 500M params, can train on 8gb ram cpu, also supports gpu for sanity sake (includes support for vllm + flash attention). Using smolLM2-135M/360M-instructs as ref & base models. Experience your own “aha” moment 🐳 on 8gb ram.
Code: https://github.com/Jaykef/ai-algorithms/blob/main/smollm2_360M_135M_grpo_gsm8k.ipynb

Great stuff , but the unsloth versions are easier and better my friend . there is no need to re invent the wheel .. perhaps an addition to the existing unsloth such as new rewards for different aspects of the input which could be added !! you did add a feature for some specific terms .... i think this was interesting ... but separate this reward... i a can always ask deepseek to simplfy your code or covert it to unsloth ... but you can do this bro !

·

bro if you had read the repo you would see that this implementation is for educational purpose, it's not done because it's easy. Not to mention unsloth is using trl's GRPO trainer which is super slow on cpu and does not scale for models under 500M params, I tried it both on cpu and gpu. This custom implementation cuts most of the heavy lifting allowing you to train and scale faster even on cpu, plus a bunch of custom configs with a simplified GRPO trainer in under 500 lines of code. There's a lot one can learn from it.