Llama.cpp imatrix quantizations of mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-7B-v1.1

Using llama.cpp commit 3ad5451 for quantization.

All quants were made using the imatrix option and Bartowski's calibration file.


Perplexity table (the lower the better)

Quant Size (MB) PPL Size (%) Accuracy (%) PPL error rate
IQ1_S 1815 29.3739 12.49 49.92 0.53
IQ1_M 1947 23.4611 13.40 62.50 0.42
IQ2_XXS 2167 23.8257 14.91 61.54 0.46
IQ2_XS 2354 20.5413 16.20 71.38 0.39
IQ2_S 2475 19.3763 17.03 75.67 0.36
IQ2_M 2651 22.3007 18.24 65.75 0.44
Q2_K_S 2702 17.5446 18.59 83.57 0.31
Q2_K 2876 16.9426 19.79 86.54 0.29
IQ3_XXS 2970 16.2668 20.44 90.14 0.29
IQ3_XS 3191 16.1443 21.96 90.82 0.29
Q3_K_S 3330 17.0364 22.92 86.07 0.29
IQ3_S 3337 16.1048 22.96 91.04 0.29
IQ3_M 3408 15.8128 23.45 92.72 0.28
Q3_K_M 3631 15.2580 24.99 96.10 0.26
Q3_K_L 3899 15.1997 26.83 96.46 0.26
IQ4_XS 4023 14.9385 27.68 98.15 0.25
IQ4_NL 4232 14.9257 29.12 98.24 0.25
Q4_0 4238 15.2621 29.17 96.07 0.26
Q4_K_S 4251 14.8852 29.25 98.50 0.26
Q4_K_M 4466 14.8666 30.73 98.63 0.26
Q4_1 4647 14.8789 31.98 98.54 0.26
Q5_K_S 5068 14.7449 34.88 99.44 0.25
Q5_0 5081 14.7425 34.97 99.46 0.25
Q5_K_M 5192 14.7327 35.73 99.52 0.25
Q5_1 5490 14.7293 37.78 99.55 0.25
Q6_K 5964 14.6907 41.04 99.81 0.25
Q8_0 7723 14.6686 53.15 99.96 0.25
F16 14531 14.6625 100 100 0.25

This is a version of the DeepSeek-R1-Distill-Qwen-7B model re-distilled for better performance.

Performance

Models DeepSeek-R1-Distill-Qwen-7B DeepSeek-R1-ReDistill-Qwen-7B-v1.1
ARC (25-shot) 55.03 52.3
HellaSwag (10-shot) 61.9 62.36
MMLU (5-shot) 56.75 59.53
TruthfulQA-MC2 45.76 47.7
Winogrande (5-shot) 60.38 61.8
GSM8K (5-shot) 78.85 83.4
Average 59.78 61.18
Models DeepSeek-R1-Distill-Qwen-7B DeepSeek-R1-ReDistill-Qwen-7B-v1.1
GPQA (0-shot) 30.9 34.99
MMLU PRO (5-shot) 28.83 31.02
MUSR (0-shot) 38.85 44.42
BBH (3-shot) 43.54 51.53
IfEval (0-shot) - strict 42.33 35.49
IfEval (0-shot) - loose 30.31 38.49

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
compute_dtype = torch.bfloat16
device   = 'cuda'
model_id = "mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-7B-v1.1"

model     = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype, attn_implementation="sdpa", device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt  = "What is 1.5+102.2?"
chat    = tokenizer.apply_chat_template([{"role":"user", "content":prompt}], tokenize=True, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(chat.to(device), max_new_tokens=1024, do_sample=True) 
print(tokenizer.decode(outputs[0]))

Output:

<|begin▁of▁sentence|><|User|>What is 1.5+102.2?<|Assistant|><think>
First, I need to add the whole number parts of the two numbers. The whole numbers are 1 and 102, which add up to 103.

Next, I add the decimal parts of the two numbers. The decimal parts are 0.5 and 0.2, which add up to 0.7.

Finally, I combine the whole number and decimal parts to get the total sum. Adding 103 and 0.7 gives me 103.7.
</think>

To add the numbers \(1.5\) and \(102.2\), follow these steps:

1. **Add the whole number parts:**
   \[
   1 + 102 = 103
   \]

2. **Add the decimal parts:**
   \[
   0.5 + 0.2 = 0.7
   \]

3. **Combine the results:**
   \[
   103 + 0.7 = 103.7
   \]

**Final Answer:**
\[
\boxed{103.7}
\]<|end▁of▁sentence|>

HQQ

Run ~3.5x faster with HQQ. First, install the dependencies:

pip install hqq
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.core.quantize import *

#Params
device        = 'cuda:0'
backend       = "torchao_int4" 
compute_dtype = torch.bfloat16 if backend=="torchao_int4" else torch.float16
model_id      = "mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-7B-v1.1"

#Load
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype, attn_implementation="sdpa")

#Quantize
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, axis=1)
AutoHQQHFModel.quantize_model(model, quant_config=quant_config, compute_dtype=compute_dtype, device=device)

#Optimize
from hqq.utils.patching import prepare_for_inference
prepare_for_inference(model, backend=backend, verbose=False)

############################################################
#Generate (streaming)
from hqq.utils.generation_hf import HFGenerator
gen = HFGenerator(model, tokenizer, max_new_tokens=4096, do_sample=True, compile='partial').warmup()

prompt = "If A equals B, and C equals B - A, what would be the value of C?" 
out    = gen.generate(prompt, print_tokens=True)

############################################################
# #Generate (simple)
# from hqq.utils.generation_hf import patch_model_for_compiled_runtime
# patch_model_for_compiled_runtime(model, tokenizer, warmup=True)

# prompt = "If A equals B, and C equals B - A, what would be the value of C?" 
# chat    = tokenizer.apply_chat_template([{"role":"user", "content":prompt}], tokenize=True, add_generation_prompt=True, return_tensors="pt")
# outputs = model.generate(chat.to(device), max_new_tokens=8192, do_sample=True) 
# print(tokenizer.decode(outputs[0]))
Downloads last month
568
GGUF
Model size
7.62B params
Architecture
qwen2

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for ThomasBaruzier/DeepSeek-R1-ReDistill-Qwen-7B-v1.1-GGUF

Quantized
(120)
this model

Collection including ThomasBaruzier/DeepSeek-R1-ReDistill-Qwen-7B-v1.1-GGUF