About:

This GRPO trained model is a fine-tuned version of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B on the DigitalLearningGmbH/MATH-lighteval dataset.

GRPO is applied after a distilled R1 model is created to further refine its reasoning capabilities. Rather than the initial distillation step—which transfers capacities from a larger model—GRPO uses reinforcement learning to optimize the policy model by maximizing a reward signal. This fine-tuning step is distinct from distillation and aims to boost performance in chain-of-thought and reasoning tasks.

Special thanks to Dongwei for fine-tuning this version of DeepSeek-R1-Distill-Qwen-7B. More information about it can be found here: https://huggingface.co./Dongwei/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math

I simply converted it to MLX format with a quantization of 4-bit for better performance on Apple Silicon Macs (M1,M2,M3,M4 Chips).

Notes:

Seems to brush over the "thinking" process and immediately start answering, leading to extremely quick but correct answers.

Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-4bit-mlx

The Model Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-4bit-mlx was converted to MLX format from Dongwei/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math using mlx-lm version 0.20.5.

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-4bit-mlx")

prompt="hello"

if hasattr(tokenizer, "apply_chat_template") and tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Alejandroolmedo
/

DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-4bit-mlx

About:

Notes:

Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-4bit-mlx

Use with mlx

Model tree for Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-4bit-mlx

Dataset used to train Alejandroolmedo/DeepSeek-R1-Distill-Qwen-7B-GRPO_Math-4bit-mlx