Creavity and Quants

#4
by DavidAU - opened

I was reading your excellent:
https://huggingface.co./datasets/froggeric/creativity

Just a heads up from testing - > Different quants result in different results especially in creativity area.
This holds true between IMATRIX and non-imatrix quants too.
Non-imatrix will have (in general) more outliers than imatrix versions.

Test method:
Temp: 0 ; test with 1 or more creative "prompts" -> ie "no right answer".
Test against each quant of the same model.

Example prompt:
Give me 3 fictional reasons a sun went supernova of 500 word for each reason.
( results will vary depending on how creative the model is )

There is a greater difference between lower quants with gradual change to less differences at higher quants.
The imatrix VS non too - can be different , as well as different imatrix "batches".

After testing over 550 models to date (and recording the results) I though I was imagining it - differences between quants.
The test method above proves I was not going crazy... well about this anyway.

Aside:
I am also creating merged composite models (pass-through using multiple models) - wow - what a difference in output.
Especially on the creativity side.
Very difficult to stabilize... but it can be done. A lot of trial and error.

Thank you for sharing. Would you mind elaborating on your observations about how different quants affect quality?

For pure generation -> Lower quants use simpler word choices (and more phrase,sentence and "sayings" rep), and depth (ie fiction scene) is lacking.
At mid point q4/iq4 -> Prose, sentence quality, and general creativity are close to maximum. Especially short sentences / variety of sentence size.
Q5/q6 -> This is where depth comes in -> Fiction takes on deep meaning and can - sometimes - provoke an emotional reaction.
Q8 -> Oddly Q8 can be "flat" - sometimes - whereas for other models this is when the model really shines.

Q5KM VS Q6 -> It seems in about 50% of cases Q5KM is BETTER than Q6 or Q8 for creative purposes.

Not sure the reason except for the fact Q5KM is slightly unbalanced (attention tensors and one other - it's at llamacpp) ,
VS Q6/Q8 are fully balanced.

There are exceptions:
1 - 34B / 70B -> even at IQ1 about 25% of 70B models "SHINE". (ie Smaug) ; for 34B IQ2 (ie Bagel) are powerful.
But this varies model by model.
2 - Some models DO NOT "shine", regardless of the quant. IE OPUS V0, V1.2 (7B). Full precision is lightyears ahead of Q8.
3 - If a model does not "shine" , a lot of the times a GPTQ 4bit-32act version WILL. (VS GGUF)
4 - At low quants CPU generation VS GPU generation can be a lot different and a lot higher quality. (math is supposedly more accurate on CPU)

That being said I have created merges of 4 models that are stable (via pass-through with mergekit),
yet the goal it is slightly unbalance them to give more creative results.

This has definitely been successful.
One can output "Hitchhikers Guide to the Galaxy" type prose.

That being said I have created merges of 4 models that are stable (via pass-through with mergekit),
yet the goal it is slightly unbalance them to give more creative results.

This has definitely been successful.
One can output "Hitchhikers Guide to the Galaxy" type prose.

What are the specific models on your profile that you used this method? I would like to try them.

Very interesting, thank you. If you don't mind, I would like to add your observations to my comments on the creativity benchmark (attributing them to you of course)

Froggeric ; by all means - thank you.
There are more to come -hopefully- as I test the full limits of llamacpp/Quanting and merge combinations/math and theory.

BigHuggyD:
Uploading the first of these over the next few days - It is a composite of Tiefighter, Holodeck, Holomax and Mythomax. [all 13B]
I should say the first set of four. (still testing the other three, and may or may not upload depending on results)
There are currently 5 working formulas (composite pass-through of 4 models) at the moment and this set of four are based on the first formula.
Currently getting a "bandwidth" issue resolved to make this err... faster.

Next will be applying this/these formula(s) against 7B models to see if it holds up so to speak and/or needs tweaking.
I am using these methods because I found that once a "pass through" merge is subjected to a slerp, ties, etc etc, it loses a lot of the unique qualities of the multi-merge pass and models contained within it.

The start of some of this work was pushing GGUFs to a higher level , which is called "Imatrix Plus" at my repo.
This sets specific parts of the GGUF (all quants including IQ1 right up to Q8) at F16.
This comes from a result of experimentation with all of LLAmaCPP options when quanting models.

BigHuggyD:
Uploading the first of these over the next few days - It is a composite of Tiefighter, Holodeck, Holomax and Mythomax. [all 13B]
I should say the first set of four. (still testing the other three, and may or may not upload depending on results)
There are currently 5 working formulas (composite pass-through of 4 models) at the moment and this set of four are based on the first formula.
Currently getting a "bandwidth" issue resolved to make this err... faster.

Thanks David! I will be on the lookout

What's funny to me is I sort of knew this on some level. It makes more sense now. I subscribe to a service that hosts about five models and the developer would swap out quants to adjust for server load / responsiveness. Those of us who used the service often would ask him if he had switched out models because we could sense the difference in response.

BigHuggyD ;

Uploading reg quants now ; Imatrix to follow plus special instructions W examples [following upload completion... err... 24 hours?!?!].

https://huggingface.co./DavidAU/TieFighter-Holodeck-Holomax-Mythomax-F1-V1-COMPOS-20B-gguf

Here is post I wrote on how to detect the differences (all Qs, IQs, CPU, GPU, etc) at all levels:

I noticed differences between quants, F16, and even CPU VS GPU generation. Here is how to reveal these differences:

1 - Set TEMP to 0 .
2 - Test quant(s) with "creative" prompts - that is no right or wrong answer - do the same for all quants. This will also reveal differences between IMATRIX (and different imatrix data) and non imatrix AND GPU VS CPU generation (marginal, but noticeable).

When "TEMP" is set to "0" only the most "likely" token will be used. The result is the same generation every time per prompt. This gives you a solid baseline to not only test different quants/imatrix but also test different models against each other.

Testing like this will reveal exact "creative chops" , as well sentence structure, word choice, generation length, and many other facts about difference models and/or different quants of the same model.

Here is a sample test prompt I use:
Give me 3 fictional reasons a sun went supernova of 500 words for each reason.

Generally there are greater differences (relative) between lower quants vs higher quants.

However this will also show differences between KS, KM, and KL as well as M,XXS,XS,X at the same level (IE Q3, IQ2 etc)
KS is more balanced than KM -> therefore a "KM" might be a better choice for creative.

( the differences are because of how quants are mixed (llamacpp formulas) at time of quantization )

Imatrix VS non imatrix is especially interesting in terms of creative differences.
(non imatrix has a larger % of outliers)

Creative models will yield significantly different answers than "general models".

I have created an "augmented" version of Westlake 10.7B here:

https://huggingface.co./DavidAU/WestLake-12.7B-v2-Brainiac-GGUF

Sign up or log in to comment