Wowowowow

#1
by owao - opened

What's happening here @bartowski ?
Did they release a new version with further RL without even teasing anything??
I saw in addition to reuploading the safetensors, they also added an incredible comparative chart.
I always had this QWQ in my heart :D QWQ, marco-o1, Alibaba were the pioneers!
So: is it a massive comeback!!?
I didn't find new info in their blog post.

OH ok!!! I just saw it was the final model! I almost forgot the first one was a preview!
I'm so looking to try this! They rock!

owao changed discussion status to closed

and thanks for the quants as always <3

Haha yes this is a very exciting release! They've been teasing it about a week now, and their preview back in January was quite promising, super awesome from them!

Already testing it and it's great. Just with some prompts it's thinking for ages...And because it's so slow on my hardware, it lasts forever. One of the promts to make it think forever:) >>>

Write a Python function that prints the next 20 leap years. Reply with only the function.

Already testing it and it's great. Just with some prompts it's thinking for ages...And because it's so slow on my hardware, it lasts forever. One of the promts to make it think forever:) >>>

Write a Python function that prints the next 20 leap years. Reply with only the function.

ah! I feel you! When you start using thinking models, you end up spending far more time refining your prompts :D And I think that's a healthier approach than throwing a dirty prompt at it and hitting generation 5 times in a row. Thinking models make you more resource conscious!
That said, I wouldn't have expected this task to use thinking a lot, but I guess more thinking is never too much :D

Haha yes this is a very exciting release! They've been teasing it about a week now, and their preview back in January was quite promising, super awesome from them!

I missed the teaser! Ok, I'm going to DL some weights to try this out! I'll try your prompt @urtuuuu ;)

I have 30 usual test questions, and I'm impressed so far. First time EVER i got correct answer to this question:
Tell me the name of a country whose name ends with 'lia'. Give me the capital city of that country as well.
Answer: Australia

I have 30 usual test questions, and I'm impressed so far. First time EVER i got correct answer to this question:
Tell me the name of a country whose name ends with 'lia'. Give me the capital city of that country as well.
Answer: Australia

I tried out on coding and summarizing yesterday it really felt it was working better than any other 32b (and many proprietary cloud ones!) I tried before.

Didn't try out with this model yet, but I one day found someone on reddit suggesting this one Reverse this string: '.DefaultCellStyle'
The thing here if you use https://tiktokenizer.vercel.app/ to see what tokenization could lead to, you will see why it's hardcore for it ;)
And when you try putting yourself in its shoes the further you progress in the steps it has to process, the less you can understand the powerful magic behind it!
For us, we see it as a simple rearrangement of individual chars, for for them, it require so much abstraction that's crazy to imagine. When you think the only thing they sees are a few token ID (~4-5 here). I can't even put words on it, cause it's combining so many concepts at the same time!

@MrDevolver I didn't want to shit on their discussion's thread (TinyR1) so I put it directly here, here was my message:
In the meantime, you can try the final QwQ which came out yesterday. Almost R1 671B's performances, 32B.

PS: I also saw some not so great things about qihoo360 (in other domains, not AI) not searching for long...
So I'm glad to point you here, welcome to the adventure! You'll see this new toy tool is awesome!!

@bartowski
https://huggingface.co./Qwen/QwQ-32B-GGUF - 6h ago

Do you know if your both quants are equivalent or if they used different settings?

what are you referring to sorry?

I was wondering if you were aware of a special quantizing process they might have used. For example as Unsloth did with R1 (distributing the quantization level selectively depending on the layers) or NexaQuant with R1-distill-llama-8B (which may be the same technique but they didn't told). That was curiosity, just in case you had some insights ;)
I should try to learn how to visualize a quantized model architecture.

Oo gotcha, to my knowledge they are identical yes, besides mine using imatrix :)

I'm not sure what nexaquant is doing, it's just a regular Q4_0 quant it looks like..? Unless they edited the code itself to use a different rounding mechanism, it's not special

I'm not sure what nexaquant is doing, it's just a regular Q4_0 quant it looks like..? Unless they edited the code itself to use a different rounding mechanism, it's not special

They might use a special process cause when using the nexa convert cmd, eg. nexa convert Qwen/QwQ-32Bthe filesize of the resulting quant differs significantly from the ones using directly llama.cpp.

Look here when I used nexa to convert Mistral-Small-24B-Instruct-2501-reasoning:

Nexa

❯❯❯ ollama list | grep nex
Mistral-Small-24B-Instruct-2501-reasoning_Q5_K_M_32kmax_nexa_match_0.7:latest    9a796c29bbba    18 GB     13 days ago

llama.cpp only (from brittlewis12/Mistral-Small-24B-Instruct-2501-reasoning-GGUF, sorry you weren't there :D)

SHA256: 3bf8328de1b6d487154557356de9ad694e4aa7fbe6daa621224b876fbf3919bd
Pointer size: 136 Bytes
Size of remote file: 16.8 GB

That's a mystery for now.

Also, from what I understood, the "NexaQuants" models they publish are not what you get using nexa convert for the same model (I didn't try to verify though). I'm thinking otherwise they would publish more models. So they must have done some manual process for those "special magic NexaQuants" models. But "what?" still remains! Maybe they'll share more info in the future :)

Hmm... That's a proprietary solution.

Model Compression
Pack a more powerful model in your device with model compression
Use our proprietary method to shrink models via quantization, pruning, and distillationβ€”without sacrificing accuracy. You'll save 4X the storage and memory while speeding up inference. Start with our pre-optimized models or compress your own models with your dataset for your specific use case.

And they can apply it on a broad range of transformers models, not only LMs: https://nexa.ai/blogs/nexaquant

Hmm the weirder thing is that at 8B their Q4_0 you linked is bigger than mine.. I'll try reading into what they're doing for curiousity

I'm very curious about how I can reproduce the Q4_K-L iMatrix quantization you provided. Could you offer a tutorial?
It does provide a noticeable boost in intelligence. I'm wondering if Q8_0 embed and output weights can also be applied to normal IQ4_XS quantization.
Additionally, could you provide a quantization for prithivMLmods/Sombrero-QwQ-32B-Elite11 ? Thank you!

I'm very curious about how I can reproduce the Q4_K-L iMatrix quantization you provided. Could you offer a tutorial?
It does provide a noticeable boost in intelligence. I'm wondering if Q8_0 embed and output weights can also be applied to normal IQ4_XS quantization.
Additionally, could you provide a quantization for prithivMLmods/Sombrero-QwQ-32B-Elite11 ? Thank you!

or improved IQ4_NL quantization for better GPU inference performance.

I'm very curious about how I can reproduce the Q4_K-L iMatrix quantization you provided. Could you offer a tutorial?
It does provide a noticeable boost in intelligence. I'm wondering if Q8_0 embed and output weights can also be applied to normal IQ4_XS quantization.
Additionally, could you provide a quantization for prithivMLmods/Sombrero-QwQ-32B-Elite11 ? Thank you!

Really simple, straight forward, and not depending on your available RAM, you can just use https://huggingface.co./spaces/ggml-org/gguf-my-repo to compute it using the free resources HF offers to us <3. But if you prefer doing it locally, you can simply click on the "..." top right of the page, and use Run locally or Clone repo methods ;)

Hmm the weirder thing is that at 8B their Q4_0 you linked is bigger than mine.. I'll try reading into what they're doing for curiousity

Yeah seems they are all bigger than when using llama.cpp. I don't know what they are doing either, from nexa/gguf/converter/nexa_convert.py, they just seem to use llama.cpp, so I don't get how same kwargs lead to different model size. I must be missing something.

Hmm the weirder thing is that at 8B their Q4_0 you linked is bigger than mine.. I'll try reading into what they're doing for curiousity

Yeah seems they are all bigger than when using llama.cpp. I don't know what they are doing either, from nexa/gguf/converter/nexa_convert.py, they just seem to use llama.cpp, so I don't get how same kwargs lead to different model size. I must be missing something.

From what I've read, their NexaQuants are completely different, not the standard ones. Honestly, I tried their 8B Deepseek R1 distilled model and they didn't lie. Its quality IS more on par with the full unquantized version while keeping it compatible with llamacpp inference (LM Studio and such) and keeping the size down like a regular Q4. All in all, it sounds like a dream come true, but they haven't shared their secret and don't seem to respond to messages... 😧

Hmm the weirder thing is that at 8B their Q4_0 you linked is bigger than mine.. I'll try reading into what they're doing for curiousity

Yeah seems they are all bigger than when using llama.cpp. I don't know what they are doing either, from nexa/gguf/converter/nexa_convert.py, they just seem to use llama.cpp, so I don't get how same kwargs lead to different model size. I must be missing something.

From what I've read, their NexaQuants are completely different, not the standard ones. Honestly, I tried their 8B Deepseek R1 distilled model and they didn't lie. Its quality IS more on par with the full unquantized version while keeping it compatible with llamacpp inference (LM Studio and such) and keeping the size down like a regular Q4. All in all, it sounds like a dream come true, but they haven't shared their secret and don't seem to respond to messages... 😧

Yeah @MrDevolver , I had the same impression when tried the model you mentionned ;) That said here we are talking about the gguf you can produce from nexa convert using their SDK, not those magic "NexaQuant"s they proprietary produce.

I'm very curious about how I can reproduce the Q4_K-L iMatrix quantization you provided. Could you offer a tutorial?
It does provide a noticeable boost in intelligence. I'm wondering if Q8_0 embed and output weights can also be applied to normal IQ4_XS quantization.
Additionally, could you provide a quantization for prithivMLmods/Sombrero-QwQ-32B-Elite11 ? Thank you!

Really simple, straight forward, and not depending on your available RAM, you can just use https://huggingface.co./spaces/ggml-org/gguf-my-repo to compute it using the free resources HF offers to us <3. But if you prefer doing it locally, you can simply click on the "..." top right of the page, and use Run locally or Clone repo methods ;)

However, it lacks customizable adjustment options for the quantization precision of different model parts, unlike the Q4_K-L provided by bartowski, which uses Q8_0 for quantizing the embed and output weights.
Also, iMatrix is only available for models under 12B, and it still requires local quantization.

I'm very curious about how I can reproduce the Q4_K-L iMatrix quantization you provided. Could you offer a tutorial?
It does provide a noticeable boost in intelligence. I'm wondering if Q8_0 embed and output weights can also be applied to normal IQ4_XS quantization.
Additionally, could you provide a quantization for prithivMLmods/Sombrero-QwQ-32B-Elite11 ? Thank you!

Really simple, straight forward, and not depending on your available RAM, you can just use https://huggingface.co./spaces/ggml-org/gguf-my-repo to compute it using the free resources HF offers to us <3. But if you prefer doing it locally, you can simply click on the "..." top right of the page, and use Run locally or Clone repo methods ;)

However, it lacks customizable adjustment options for the quantization precision of different model parts, unlike the Q4_K-L provided by bartowski, which uses Q8_0 for quantizing the embed and output weights.
Also, iMatrix is only available for models under 12B, and it still requires local quantization.

Oh I didn't know imatrix were only for small models there. So https://github.com/ggml-org/llama.cpp/blob/master/convert_hf_to_gguf.py seems a better option, the command should be quite straight forward but the --help was really poor and so I never retained the syntax/the correct short names for the quant types :/

Oh and even if it didn't help, your are always welcome!

I'm very curious about how I can reproduce the Q4_K-L iMatrix quantization you provided. Could you offer a tutorial?
It does provide a noticeable boost in intelligence. I'm wondering if Q8_0 embed and output weights can also be applied to normal IQ4_XS quantization.
Additionally, could you provide a quantization for prithivMLmods/Sombrero-QwQ-32B-Elite11 ? Thank you!

Really simple, straight forward, and not depending on your available RAM, you can just use https://huggingface.co./spaces/ggml-org/gguf-my-repo to compute it using the free resources HF offers to us <3. But if you prefer doing it locally, you can simply click on the "..." top right of the page, and use Run locally or Clone repo methods ;)

However, it lacks customizable adjustment options for the quantization precision of different model parts, unlike the Q4_K-L provided by bartowski, which uses Q8_0 for quantizing the embed and output weights.
Also, iMatrix is only available for models under 12B, and it still requires local quantization.

Ok I actually need to do it today!
So to relieve the pain of llama.cpp "documentation", here are the only 2 commands you need: https://qwen.readthedocs.io/en/stable/quantization/llama.cpp.html
You need to clone the repo to get the convert-hf-to-gguf.py script, then build the repo or download a release to get the llama-quantize binary.

llama-quantize

Allowed quantization types:
   2  or  Q4_0    :  4.34G, +0.4685 ppl @ Llama-3-8B
   3  or  Q4_1    :  4.78G, +0.4511 ppl @ Llama-3-8B
   8  or  Q5_0    :  5.21G, +0.1316 ppl @ Llama-3-8B
   9  or  Q5_1    :  5.65G, +0.1062 ppl @ Llama-3-8B
  19  or  IQ2_XXS :  2.06 bpw quantization
  20  or  IQ2_XS  :  2.31 bpw quantization
  28  or  IQ2_S   :  2.5  bpw quantization
  29  or  IQ2_M   :  2.7  bpw quantization
  24  or  IQ1_S   :  1.56 bpw quantization
  31  or  IQ1_M   :  1.75 bpw quantization
  36  or  TQ1_0   :  1.69 bpw ternarization
  37  or  TQ2_0   :  2.06 bpw ternarization
  10  or  Q2_K    :  2.96G, +3.5199 ppl @ Llama-3-8B
  21  or  Q2_K_S  :  2.96G, +3.1836 ppl @ Llama-3-8B
  23  or  IQ3_XXS :  3.06 bpw quantization
  26  or  IQ3_S   :  3.44 bpw quantization
  27  or  IQ3_M   :  3.66 bpw quantization mix
  12  or  Q3_K    : alias for Q3_K_M
  22  or  IQ3_XS  :  3.3 bpw quantization
  11  or  Q3_K_S  :  3.41G, +1.6321 ppl @ Llama-3-8B
  12  or  Q3_K_M  :  3.74G, +0.6569 ppl @ Llama-3-8B
  13  or  Q3_K_L  :  4.03G, +0.5562 ppl @ Llama-3-8B
  25  or  IQ4_NL  :  4.50 bpw non-linear quantization
  30  or  IQ4_XS  :  4.25 bpw non-linear quantization
  15  or  Q4_K    : alias for Q4_K_M
  14  or  Q4_K_S  :  4.37G, +0.2689 ppl @ Llama-3-8B
  15  or  Q4_K_M  :  4.58G, +0.1754 ppl @ Llama-3-8B
  17  or  Q5_K    : alias for Q5_K_M
  16  or  Q5_K_S  :  5.21G, +0.1049 ppl @ Llama-3-8B
  17  or  Q5_K_M  :  5.33G, +0.0569 ppl @ Llama-3-8B
  18  or  Q6_K    :  6.14G, +0.0217 ppl @ Llama-3-8B
   7  or  Q8_0    :  7.96G, +0.0026 ppl @ Llama-3-8B
   1  or  F16     : 14.00G, +0.0020 ppl @ Mistral-7B
  32  or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B
   0  or  F32     : 26.00G              @ 7B
          COPY    : only copy tensors, no quantizing

the Q4_K_L specifically is made like this:

./llama-quantize --imatrix model.imatrix --output-tensor-type q8_0 --token-embedding-type q8_0 ./Model-Conversion-F32.gguf ./Model-Quant-Q4_K_M.gguf Q4_K_M

Sign up or log in to comment