About:
This GRPO trained model is a fine-tuned version of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B on the DigitalLearningGmbH/MATH-lighteval dataset.
GRPO is applied after a distilled R1 model is created to further refine its reasoning capabilities. Rather than the initial distillation step—which transfers capacities from a larger model—GRPO uses reinforcement learning to optimize the policy model by maximizing a reward signal. This fine-tuning step is distinct from distillation and aims to boost performance in chain-of-thought and reasoning tasks.
Special thanks to Dongwei for fine-tuning this version of DeepSeek-R1-Distill-Qwen-7B. More information about it can be found here: https://huggingface.co./Dongwei/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math
I simply converted it to MLX format with a quantization of 8-bit for better performance on Apple Silicon Macs (M1,M2,M3,M4 Chips).
Notes:
- Seems to brush over the "thinking" process and immediately start answering, leading to extremely quick but correct answers.
Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-8bit-mlx
The Model Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-8bit-mlx was converted to MLX format from Dongwei/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math using mlx-lm version 0.20.5.
Use with mlx
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-8bit-mlx")
prompt="hello"
if hasattr(tokenizer, "apply_chat_template") and tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
- Downloads last month
- 12
Model tree for Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-8bit-mlx
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B