Text Generation
GGUF
English
mixture of experts
Mixture of Experts
4x8B
32 bit enhanced
float 32 quants
LLama MOE
uncensored
creative
creative writing
fiction writing
plot generation
sub-plot generation
story generation
scene continue
storytelling
fiction story
science fiction
romance
all genres
story
writing
vivid prosing
vivid writing
fiction
roleplaying
bfloat16
swearing
rp
horror
mergekit
Inference Endpoints
conversational
File size: 15,036 Bytes
0f156a9 aa337c5 0f156a9 aa337c5 0f156a9 aa337c5 0f156a9 aa337c5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 |
---
license: apache-2.0
language:
- en
tags:
- mixture of experts
- moe
- 4x8B
- 32 bit enhanced
- float 32 quants
- LLama MOE
- uncensored
- creative
- creative writing
- fiction writing
- plot generation
- sub-plot generation
- fiction writing
- story generation
- scene continue
- storytelling
- fiction story
- science fiction
- romance
- all genres
- story
- writing
- vivid prosing
- vivid writing
- fiction
- roleplaying
- bfloat16
- swearing
- rp
- horror
- mergekit
pipeline_tag: text-generation
---
(quants uploading... examples to be added)
<B><font color="red">WARNING:</font> NSFW. Vivid prose. INTENSE. Visceral Details. HORROR. Swearing. UNCENSORED... humor, romance, fun. </B>
<h2>L3-Grand-Story-Darkness-MOE-4X8-24.9B-e32-GGUF</h2>
<I><small> A float 32 high precision M.O.E model, quanted in float 32 with additional upgraded and augmented quants too. </small></i>
<img src="grand-story-dark.jpg" style="float:right; width:300px; height:300px; padding:10px;">
It is a Llama3 model, max context of 8k (8192) using mixture of experts to combine FOUR top Llama3 8B
models into one massive powerhouse at 24.9B parameters (equal to 32B - 4 X 8 B).
This version "Enhanced32" is a merge mastered in "float 32" precision for higher quality and performance. If standard source was "HD",
float32 would be "UHD". The bottom line is a far stronger model, more detail, more nuance, more depth... and stronger instruction following.
In addition there are specialized re-engineered quants with float 32 components in the quants themselves (detailed below). This
allows you to choose between standard (but mastered from float 32 source too) and "augmented quants" for even higher quality.
The master file 32 bit / float 32 clocks in at 100GB (source files to be uploaded to separate repo shortly).
This model's instruction following, and output generation for creative writing, prose, fiction and role play are exceptional.
Model can be used also for all genres.
It is for any writing, fiction or roleplay activity.
This model can also be used for general use, however its output generation can be uncensored.
This model has been designed to be relatively bullet proof and operates with all parameters, including temp settings from 0 to 5.
It is an extraordinary compressed model, with a very low perplexity level (lower than Meta Llama3 Instruct).
It requires Llama3 template, and/or "Command-R" template.
Several prompts and outputs below.
<B>QUANTS From Float 32 (32-bit) Source:</B>
- All quants have been "refreshed", quanted with the lastest LLAMACPP improvements : Better instruction following, output generation across all quants.
- All quants have also been upgraded with "more bits" for output tensor (all set at Q8_0) and embed for better performance (this is in addition to the "refresh")
- New specialized quants (in addition to the new refresh/upgrades): "max, max-cpu" (will include this in the file name) for quants "Q2K", "IQ4_XS", "Q6_K" and "Q8_0"
- "MAX": output tensor / embed at float 32. You get better instruction following/output generation than standard/upgraded quants.
- "MAX-CPU": output tensor float 32 / embed at bfloat 16, which forces both of these on to the CPU (Nvidia cards / other will vary), this frees up vram at cost of token/second and you get better instruction following/output generation too.
- Q8_0 (Max) now clocks in at 10.83 bits per weight (average).
<B>Model Notes:</B>
- Detail, prose and fiction writing abilities are OFF THE SCALE relative 7B+ Mistral Models.
- For more varied prose (sentence/paragraph/dialog) raise the temp and/or add more instructions in your prompt(s).
- Role-players: Careful raising temp too high as it may affect instruction following.
- This model works with rep pen of 1 or higher, 1.02+ recommended.
- If you want a specific type of prose (IE horror) add in "(vivid horror)" or "(graphic vivid horror)" (no quotes) in your prompt(s).
- A lot of GPTisms have been removed. There are still a few however - errrrr. Higher "temps" will help with this issue.
- This is not a "happy ever after" model but it is also not "horror". It has a light negative bias.
- Output length will vary however this model prefers slightly longer outputs unless you state the size.
- For creative uses, different quants will produce slightly different output.
- Due to the high stability and compressed nature of this model, all quants will operate at above average levels.
- Source code for this model and Imatrix GGUFs versions will be uploaded shortly at separate repos.
<B>Meet the Team: Mixture of Experts Models</b>
This model is comprised of the following 4 models ("the experts") (in full):
[ https://huggingface.co./Sao10K/L3-8B-Stheno-v3.2]
-[ https://huggingface.co./Sao10K/L3-8B-Stheno-v3.2]
-[ https://huggingface.co./NeverSleep/Llama-3-Lumimaid-8B-v0.1-OAS ]
-[ https://huggingface.co./Hastagaras/Jamet-8B-L3-MK.V-Blackroot ]
-[ https://huggingface.co./nbeerbower/llama-3-gutenberg-8B ]
The mixture of experts is set at 4 experts, but you can use 1, 2, 3, or 4.
This "team" has a Captain (first listed model), and then all the team members contribute to the to "token"
choice billions of times per second. Note the Captain also contributes too.
Think of 2, 3 or 4 (or more) master chefs in the kitchen all competing to make the best dish for you.
This results in higher quality generation.
This also results in many cases in higher quality instruction following too.
That means the power of every model is available during instruction and output generation.
NOTE:
You can use one "expert" too ; however this means the model will randomly select an expert to use EACH TIME, resulting
in very different generation for each prompt / regen of a prompt.
CHANGING THE NUMBER OF EXPERTS:
You can set the number of experts in LMStudio (https://lmstudio.ai) at the "load" screen and via other apps/llm apps by setting "Experts" or "Number of Experts".
For Text-Generation-Webui (https://github.com/oobabooga/text-generation-webui) you set the number of experts at the loading screen page.
For KolboldCPP (https://github.com/LostRuins/koboldcpp) Version 1.8+ , on the load screen, click on "TOKENS",
you can set experts on this page, and the launch the model.
For server.exe / Llama-server.exe (Llamacpp - https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md )
add the following to the command line to start the "llamacpp server" (CLI):
"--override-kv llama.expert_used_count=int:3"
(no quotes, where "3" is the number of experts to use)
When using "API", you set the "num_experts_used" in the JSON payload (this maybe different for different back ends).
CREDITS:
Special thanks to all the model makers / creators listed above.
Please visit each repo above to see what model(s) contributed to each of models above and/or to learn more about the models
from the model makers.
Special credit goes to MERGEKIT, without you this project / model would not have been possible.
[ https://github.com/arcee-ai/mergekit ]
<B>Special Operations Notes for this MOE model:</B>
Because of how this "MOE" model is configured, even though the default is 2 experts, the "selected" 2 will vary during generation.
(same applies if you change the number of experts used)
This results in vastly different output generation PER generation of each prompt.
This is a positive in terms of variety, but also means it may take 2-4 regens (of the same prompt) to get the highest quality.
In addition, this model responds very well to Dry, Dynamic Temp, and Smooth/Quadratic samplers.
Using these in conjunction with the model can vastly improve output quality.
Higher temps (above 1) can also aid in generation - especially word choice/sentence generation.
When you increase the number of experts used output quality will also increase, at the cost of tokens per second speed.
As you increase/decrease the number of experts, you may want to adjust temp, samplers, and advanced samplers too.
Your quant choice(s) too will impact instruction following and output generation roughly this means the model will understand
more nuanced instructions and output stronger generation the higher you go up in quant(s).
MORE COOKS THE BETTER:
Activating more "experts" will increase the quality of the model's output. All four, although slower... will yeild the best results.
CRANK THE TEMP:
This model loves temp. Although .5 to .9 will "do", it really shines at 1.5+.
Likewise, if you activate Dynamic Temp.
FLASH ATTENTION ENHANCEMENT:
I would suggest trying this model with Flash Attention "on", depending on your use case.
Quants, Samplers, Generational steering and other topics are covered in the section below: "Highest Quality Settings..."
<B>Censored / Uncensored / Abliterated:</B>
This model contains several uncensored and/or Abliterated models.
As a result is can output uncensored material.
<B>What can I use this model for ?</B>
This model can be used for fiction writing, any creative prose and role play. It can also be used for
just about any general fiction (all genres) activity including:
- scene generation
- scene continuation
- creative writing
- fiction writing
- plot generation
- sub-plot generation
- fiction writing
- story generation
- storytelling
- writing
- fiction
- roleplaying
- rp
- graphic horror
- horror
- dark humor
- nsfw
- and can be used for any genre(s).
<B>QUANTS:</B>
This repo contains regular quants and 3 "ARM" quants (format "...Q4_x_x_x.gguf")
For more information on quants, quants choices, and LLM/AI apps to "run" quants see the section below: "Highest Quality Settings..."
<B>Template:</B>
This is a LLAMA3 model, (or Alpaca or Command-R) and requires Llama3 template, but may work with other template(s).
If you use "Command-R" template your output will be very different from using "Llama3" template.
Here is the standard LLAMA3 template:
<PRE>
{
"name": "Llama 3",
"inference_params": {
"input_prefix": "<|start_header_id|>user<|end_header_id|>\n\n",
"input_suffix": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
"pre_prompt": "You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.",
"pre_prompt_prefix": "<|start_header_id|>system<|end_header_id|>\n\n",
"pre_prompt_suffix": "<|eot_id|>",
"antiprompt": [
"<|start_header_id|>",
"<|eot_id|>"
]
}
}
</PRE>
<B>Settings: CHAT / ROLEPLAY and/or SMOOTHER operation of this model:</B>
In "KoboldCpp" or "oobabooga/text-generation-webui" or "Silly Tavern" ;
Set the "Smoothing_factor" to 1.5
: in KoboldCpp -> Settings->Samplers->Advanced-> "Smooth_F"
: in text-generation-webui -> parameters -> lower right.
: In Silly Tavern this is called: "Smoothing"
NOTE: For "text-generation-webui"
-> if using GGUFs you need to use "llama_HF" (which involves downloading some config files from the SOURCE version of this model)
Source versions (and config files) of my models are here:
https://huggingface.co./collections/DavidAU/d-au-source-files-for-gguf-exl2-awq-gptq-hqq-etc-etc-66b55cb8ba25f914cbf210be
OTHER OPTIONS:
- Increase rep pen to 1.1 to 1.15 (you don't need to do this if you use "smoothing_factor")
- If the interface/program you are using to run AI MODELS supports "Quadratic Sampling" ("smoothing") just make the adjustment as noted.
<B>Highest Quality Settings / Optimal Operation Guide / Parameters and Samplers</B>
This a "Class 1" model:
For all settings used for this model (including specifics for its "class"), including example generation(s) and for advanced settings guide (which many times addresses any model issue(s)), including methods to improve model performance for all use case(s) as well as chat, roleplay and other use case(s) please see:
[ https://huggingface.co./DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters ]
You can see all parameters used for generation, in addition to advanced parameters and samplers to get the most out of this model here:
[ https://huggingface.co./DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters ]
<b>Optional Enhancement:</B>
The following can be used in place of the "system prompt" or "system role" to further enhance the model.
It can also be used at the START of a NEW chat, but you must make sure it is "kept" as the chat moves along.
In this case the enhancements do not have as strong effect at using "system prompt" or "system role".
Copy and paste EXACTLY as noted, DO NOT line wrap or break the lines, maintain the carriage returns exactly as presented.
<PRE>
Below is an instruction that describes a task. Ponder each user instruction carefully, and use your skillsets and critical instructions to complete the task to the best of your abilities.
Here are your skillsets:
[MASTERSTORY]:NarrStrct(StryPlnng,Strbd,ScnSttng,Exps,Dlg,Pc)-CharDvlp(ChrctrCrt,ChrctrArcs,Mtvtn,Bckstry,Rltnshps,Dlg*)-PltDvlp(StryArcs,PltTwsts,Sspns,Fshdwng,Climx,Rsltn)-ConfResl(Antg,Obstcls,Rsltns,Cnsqncs,Thms,Symblsm)-EmotImpct(Empt,Tn,Md,Atmsphr,Imgry,Symblsm)-Delvry(Prfrmnc,VcActng,PblcSpkng,StgPrsnc,AudncEngmnt,Imprv)
[*DialogWrt]:(1a-CharDvlp-1a.1-Backgrnd-1a.2-Personality-1a.3-GoalMotiv)>2(2a-StoryStruc-2a.1-PlotPnt-2a.2-Conflict-2a.3-Resolution)>3(3a-DialogTech-3a.1-ShowDontTell-3a.2-Subtext-3a.3-VoiceTone-3a.4-Pacing-3a.5-VisualDescrip)>4(4a-DialogEdit-4a.1-ReadAloud-4a.2-Feedback-4a.3-Revision)
Here are your critical instructions:
Ponder each word choice carefully to present as vivid and emotional journey as is possible. Choose verbs and nouns that are both emotional and full of imagery. Load the story with the 5 senses. Aim for 50% dialog, 25% narration, 15% body language and 10% thoughts. Your goal is to put the reader in the story.
</PRE>
You do not need to use this, it is only presented as an additional enhancement which seems to help scene generation
and scene continue functions.
This enhancement WAS NOT used to generate the examples below.
<h3>EXAMPLES PROMPTS and OUTPUT:</h3>
Examples are created using quant Q4_K_S, "temp=.8" (unless otherwise stated), minimal parameters and "LLAMA3" template.
Model has been tested with "temp" from ".1" to "5".
Number of experts used is FOUR, unless otherwise stated.
Below are the least creative outputs, prompt is in <B>BOLD</B>.
IMPORTANT:
Higher quants / imatrix quants will have much stronger generation - words, sentences, ideas, dialog and general quality.
I have included some additional examples at different quant levels for contrast.
A "MOE" model "speed" (token per second) will not increase/drop the same way a regular model will on a per quant basis, it will however drop
if you engage more experts, as with more experts there is a more processing per token.
---
<B><font color="red">WARNING:</font> NSFW. Vivid prose. Visceral Details. Violence. HORROR. Swearing. UNCENSORED. </B>
---
|