We are working on creating a single 22b from this model
Currently me and a friend are attempting to remove 1 22b expert from this mixtral model to hopefully create it own mistral 22b parameter standalone model. If we succeed, would you like us to upload the weights to this hf account?
We already have code to do something similar, but we just need to adjust it slightly
https://github.com/MeNicefellow/Mixtral-Expert-Trimmer
Do you think it's possible to maybe create a 6x22b or 4x22b model to make it fit into 2x24gb cards better?
@CyberTimon Unfortunatly this is not possible without severly degrading the performance. The resulting model would basically be useless without fully retraining the router and possibly the entire model. So we are hoping by only removing 1 model and using it by itself it would work well as a standalone model without MoE
Ah that's unfortunate. But as far as I understand megablocks / MoE your experiment will also not work. 1 "expert" learns for example sentence positions or have more activations when asking history related facts etc so how are you planning to extract a "working" 22b model?
"how are you planning to extract a "working" 22b model?"
With alot of hope and prayer
Hi
A noobie question, Mistral have released the 8x22B model which is 260 GB (on torrent). So how can this be used for inference ? Does it require the entire model to be laoded into memory, and therefore > 260GB of RAM. Or is this model supposed to be used to create smaller models, that can then be used on normal desktops with decent GPU/RAM) ?
You can use the BnB 4bit quantized version:
https://huggingface.co./mistral-community/Mixtral-8x22B-v0.1-4bit
If you manage to grab 1 expert why not each of all 8? It's possible some kind of merge would make them more useful from there? (Or less useful!)
great idea
Currently me and a friend are attempting to remove 1 22b expert from this mixtral model to hopefully create it own mistral 22b parameter standalone model. If we succeed, would you like us to upload the weights to this hf account?
Just fyi, the author of MergeKit did something similar with mixtral 8x7b and each expert didn’t generate comprehensible text (see DeMixtral), also merging experts together didn’t work. So you might need to fine tune quite a bit to fix it
@mrfakename I was able to find demixtral, but couldn't find any reports of merging all the experts together. Can you help me find the source on failing to merge experts? Thanks in advance
Unfortunately also uninterpretable garbage. :( Maybe there's a merge technique that would make something work, but I haven't found one yet.
Thank you! (for future reference that was said by cg in this GH issue thread)
Looks like someone did it, but the model seems to lack knowledge
https://huggingface.co./Vezora/Mistral-22B-v0.1
Looks like someone did it, but the model seems to lack knowledge
https://huggingface.co./Vezora/Mistral-22B-v0.1
The model generated incomprehensible text so they QLoRA'd it and it became a usable model
Hi, I am currently playing with the 1x22b version of Vezora-Mistral-22B-v0.2 and dolphin-2.9.1-mixtral-1x22b.
I ran a evaluations against Vezora-Mistral-22b-v0.2 and dolphin-2.9.1-mixtral-1x22b to get PL_Alpha_Hill
as describe in AlphaLoRA: Assigning LoRA Experts Based on Layer Training Quality
AlphaLoRA measures the layer training quality based on the HT characteristic of the layer ESDs, which is quantified by the HT metric PL_Alpha_Hill
I am actually curious, is 8x22b massively under trained ?? Does that means it has lots of unused potential ? I was thinking to bring the 1x22b back to MoE but with QLoRA, adaptive rank per layer and uneven expert spreading. Using Parameter-Efficient-MoE and AlphaLoRA.
A 22b q4_k_m + LoRA experts MoE could become the holy grail for 3090 or APU ? A mini Deepseek V3 ?
It would be possible in the specific case of 8x22b to have a minimum of 8 experts per layer corresponding to the actual Mixtral extraction with some layers having lower LoRA rank like r64 and some with much larger LoRA ranks. Also for layers having PL_Alpha_Hill larger than 3.0, it could have double the experts and use some way to trick the router to choose from one of the clones during retraining. Since clone LoRAs would be identical to their original part, it would still work theoretically if the router ask the clone instead of the original. On low PL_Alpha_Hill
, we could collapse 2 LoRA into 1 like 8->4 experts by merging and have virtual experts
.
Or basically require a rerun of hermes or dolphin training ?
This is the results (note I had to hack the at the code a bit to run on a 3090, so check my merge request).
Thanks to all great contributors !
Paging : @ehartford @Vezora
PL_Alpha_Hill
dolphin-2.9.1-mixtral-1x22b
PL_Alpha_Hill for layer 1: 1.7081
PL_Alpha_Hill for layer 2: 2.4446
PL_Alpha_Hill for layer 3: 2.5381
PL_Alpha_Hill for layer 4: 2.4501
PL_Alpha_Hill for layer 5: 3.0330
PL_Alpha_Hill for layer 6: 2.7296
PL_Alpha_Hill for layer 7: 3.6432
PL_Alpha_Hill for layer 8: 2.7240
PL_Alpha_Hill for layer 9: 2.7166
PL_Alpha_Hill for layer 10: 3.2729
PL_Alpha_Hill for layer 11: 2.3617
PL_Alpha_Hill for layer 12: 2.9306
PL_Alpha_Hill for layer 13: 2.5345
PL_Alpha_Hill for layer 14: 2.2117
PL_Alpha_Hill for layer 15: 2.7240
PL_Alpha_Hill for layer 16: 3.5451
PL_Alpha_Hill for layer 17: 2.4778
PL_Alpha_Hill for layer 18: 2.8558
PL_Alpha_Hill for layer 19: 2.3630
PL_Alpha_Hill for layer 20: 2.3130
PL_Alpha_Hill for layer 21: 3.5228
PL_Alpha_Hill for layer 22: 3.3735
PL_Alpha_Hill for layer 23: 3.4345
PL_Alpha_Hill for layer 24: 4.1321
PL_Alpha_Hill for layer 25: 2.6299
PL_Alpha_Hill for layer 26: 3.2742
PL_Alpha_Hill for layer 27: 1.0000
PL_Alpha_Hill for layer 28: 1.0000
PL_Alpha_Hill for layer 29: 1.0000
PL_Alpha_Hill for layer 30: 1.0000
PL_Alpha_Hill for layer 31: 1.0000
PL_Alpha_Hill for layer 32: 1.0000
PL_Alpha_Hill for layer 33: 1.0000
PL_Alpha_Hill for layer 34: 1.0000
PL_Alpha_Hill for layer 35: 1.0000
PL_Alpha_Hill for layer 36: 1.0000
PL_Alpha_Hill for layer 37: 1.0000
PL_Alpha_Hill for layer 38: 1.0000
PL_Alpha_Hill for layer 39: 1.0000
PL_Alpha_Hill for layer 40: 1.0000
PL_Alpha_Hill for layer 41: 1.0000
PL_Alpha_Hill for layer 42: 1.0000
PL_Alpha_Hill for layer 43: 1.0000
PL_Alpha_Hill for layer 44: 1.0000
PL_Alpha_Hill for layer 45: 1.0000
PL_Alpha_Hill for layer 46: 1.0000
PL_Alpha_Hill for layer 47: 1.0000
PL_Alpha_Hill for layer 48: 1.0000
PL_Alpha_Hill for layer 49: 1.0000
PL_Alpha_Hill for layer 50: 1.0000
PL_Alpha_Hill for layer 51: 1.0000
PL_Alpha_Hill for layer 52: 1.0000
PL_Alpha_Hill for layer 53: 1.0000
PL_Alpha_Hill for layer 54: 1.0000
PL_Alpha_Hill for layer 55: 1.0000
PL_Alpha_Hill for layer 56: 1.0000
Vezora-Mistral-22B-v0.2
PL_Alpha_Hill for layer 1: 1.7365
PL_Alpha_Hill for layer 2: 2.3165
PL_Alpha_Hill for layer 3: 2.3017
PL_Alpha_Hill for layer 4: 2.6464
PL_Alpha_Hill for layer 5: 2.7786
PL_Alpha_Hill for layer 6: 2.7357
PL_Alpha_Hill for layer 7: 3.7902
PL_Alpha_Hill for layer 8: 2.8875
PL_Alpha_Hill for layer 9: 2.6768
PL_Alpha_Hill for layer 10: 3.2108
PL_Alpha_Hill for layer 11: 2.3427
PL_Alpha_Hill for layer 12: 3.0512
PL_Alpha_Hill for layer 13: 2.6569
PL_Alpha_Hill for layer 14: 2.2624
PL_Alpha_Hill for layer 15: 3.1096
PL_Alpha_Hill for layer 16: 2.5973
PL_Alpha_Hill for layer 17: 2.4316
PL_Alpha_Hill for layer 18: 3.0323
PL_Alpha_Hill for layer 19: 2.4080
PL_Alpha_Hill for layer 20: 2.4066
PL_Alpha_Hill for layer 21: 3.7185
PL_Alpha_Hill for layer 22: 3.4868
PL_Alpha_Hill for layer 23: 3.6200
PL_Alpha_Hill for layer 24: 4.1489
PL_Alpha_Hill for layer 25: 3.0064
PL_Alpha_Hill for layer 26: 3.4390
PL_Alpha_Hill for layer 27: 3.3214
PL_Alpha_Hill for layer 28: 1.0000
PL_Alpha_Hill for layer 29: 1.0000
PL_Alpha_Hill for layer 30: 1.0000
PL_Alpha_Hill for layer 31: 1.0000
PL_Alpha_Hill for layer 32: 1.0000
PL_Alpha_Hill for layer 33: 1.0000
PL_Alpha_Hill for layer 34: 1.0000
PL_Alpha_Hill for layer 35: 1.0000
PL_Alpha_Hill for layer 36: 1.0000
PL_Alpha_Hill for layer 37: 1.0000
PL_Alpha_Hill for layer 38: 1.0000
PL_Alpha_Hill for layer 39: 1.0000
PL_Alpha_Hill for layer 40: 1.0000
PL_Alpha_Hill for layer 41: 1.0000
PL_Alpha_Hill for layer 42: 1.0000
PL_Alpha_Hill for layer 43: 1.0000
PL_Alpha_Hill for layer 44: 1.0000
PL_Alpha_Hill for layer 45: 1.0000
PL_Alpha_Hill for layer 46: 1.0000
PL_Alpha_Hill for layer 47: 1.0000
PL_Alpha_Hill for layer 48: 1.0000
PL_Alpha_Hill for layer 49: 1.0000
PL_Alpha_Hill for layer 50: 1.0000
PL_Alpha_Hill for layer 51: 1.0000
PL_Alpha_Hill for layer 52: 1.0000
PL_Alpha_Hill for layer 53: 1.0000
PL_Alpha_Hill for layer 54: 1.0000
PL_Alpha_Hill for layer 55: 1.0000
PL_Alpha_Hill for layer 56: 1.0000
Mistral NeMo 12b instruct
PL_Alpha_Hill for layer 1: 2.8483
PL_Alpha_Hill for layer 2: 3.9687
PL_Alpha_Hill for layer 3: 3.6483
PL_Alpha_Hill for layer 4: 4.6750
PL_Alpha_Hill for layer 5: 3.3442
PL_Alpha_Hill for layer 6: 3.6857
PL_Alpha_Hill for layer 7: 3.8457
PL_Alpha_Hill for layer 8: 3.5505
PL_Alpha_Hill for layer 9: 3.4881
PL_Alpha_Hill for layer 10: 2.7972
PL_Alpha_Hill for layer 11: 4.1843
PL_Alpha_Hill for layer 12: 3.5826
PL_Alpha_Hill for layer 13: 3.2662
PL_Alpha_Hill for layer 14: 3.3232
PL_Alpha_Hill for layer 15: 3.9827
PL_Alpha_Hill for layer 16: 2.9114
PL_Alpha_Hill for layer 17: 3.0528
PL_Alpha_Hill for layer 18: 4.3605
PL_Alpha_Hill for layer 19: 3.4614
PL_Alpha_Hill for layer 20: 3.3892
PL_Alpha_Hill for layer 21: 4.2361
PL_Alpha_Hill for layer 22: 4.4134
PL_Alpha_Hill for layer 23: 4.8992
PL_Alpha_Hill for layer 24: 4.1821
PL_Alpha_Hill for layer 25: 5.0604
PL_Alpha_Hill for layer 26: 6.5571
PL_Alpha_Hill for layer 27: 4.4651
PL_Alpha_Hill for layer 28: 5.2947
PL_Alpha_Hill for layer 29: 4.7900
PL_Alpha_Hill for layer 30: 4.4452
PL_Alpha_Hill for layer 31: 4.7342
PL_Alpha_Hill for layer 32: 4.7901
PL_Alpha_Hill for layer 33: 4.4934
PL_Alpha_Hill for layer 34: 4.8650
PL_Alpha_Hill for layer 35: 3.8529
PL_Alpha_Hill for layer 36: 4.2417
PL_Alpha_Hill for layer 37: 1.0000
PL_Alpha_Hill for layer 38: 1.0000
PL_Alpha_Hill for layer 39: 1.0000
PL_Alpha_Hill for layer 40: 1.0000
It's possible
8x22b was pretty much a flop, 72b was stronger and smaller.
As to why, I could only guess.
In fact we had pretty much given up on MoE until Deepseek proved it could be good
I think MoE should be an answer to dense model, it's just that we didn't find the right configuration.
I think Mixed Size Intelligent Routing Contracting and Expanding QLoRA MoE has no functional example. I mean use something like a classifier ie. ModernBert on input and get : Languages, Task descriptions, Task Complexity, Knowledge domains, has code. Like Nvidia's classifiers as example.
Then pass that to a route planner
that would instruct the routers in the MoE what LoRAs are available for a given token at a given layer. The planner could disable whole layers if the task is easy and choose best top_k 8 expert at each layer but leave the task to the router to choose from those 8.
I read that it can be as fast as 2ms to load a LoRA from CPU RAM to VRAM. Combine that with thousands of LoRAs per layers. Also the model could shrink by disabling certain layers and become 1/5 of it's size (like how mergekit can reduce layers and LoRAs will patch the gaps between layers). So a very "large dense-like" model like 72b could fit in NVRAM if only 22b where activated most of the time and be really slow if it's asked a very hard task that would require all layers to be activated. But then it could also be as fast as a 1b-3b-8b model it the task is easy like tool calling.
An expanding or contracting model is something also not yet available. Something like take a 8b model, expand it's middle by duplicating like 4-8 layers a few times, apply a LoRa on those virtual layers
and finetune it to be more intelligent and you get a 22b model with slightly(?) less performance than a real 22b but that can fit in quite less NVRAM. We could also do the reverse and compress the middle layers by extracting LoRa from that layer to a later layer that will stay there. So compressing a 22b -> 8b + a bunch of LoRa and virtual layers. For now, there seems to be some problem with KV cache. There is some talk about [FrankenModels])https://github.com/turboderp-org/exllamav2/pull/275) that could lead to model expansion/contraction.
P.S. Retraining the routers in Mixtral could be done like Mergekit MoE random
gate. Freeze the layers and train only the routers. The best way would be a small set of very diverse multilingual dataset that has it's answers made with the original Mixtral 8x22b. I think that would help the gates reinitialize itself quickly before training other datasets with the layers unfrozen.
But then again, I might be mistaken.
Thanks for your great work @ehartford !
By the way with this speak of merge kit and Moe,
I'm working on a self merge of Deepseek-v3. The hope is to merge every other layer together and make a model half the size and hopefully still pretty good
If you do that consider pruning different layers for different experts.
Check this post that I just made about model compression.
Also, just don't merge all layers the same way in pairs. I really think you should look at PL_Alpha_Hill
and skip some layers for the merge while you may merge 3 or more of some other layers. Merging layers could keep the model working by removing the routers between layers ?
You could identify experts with less perplexity and merge many of them (if they are for the same layer), but keep large perplexity experts almost untouched. Use virtual experts
to route to them.
Is it for size more than for speed ?
Then if it's for size, re-read my proposition for virtual experts
, try merging experts and add a virtual route
to the missing expert that points to the merged expert.
Deepseek seems 256 experts and 61 layers. If you convert all experts to lora for a start ? And quantization the model at q2. It should be a first step. It it works, lower lora ranks until the model starts breaking. Then check virtual layers
and virtual experts
to further shrink it. Or just go with layer pruning/merging but fine-tuning will be needed.
Good luck !