mobicham commited on
Commit
bd7e5f5
·
1 Parent(s): 6f6f04b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -0
README.md CHANGED
@@ -1,3 +1,22 @@
1
  ---
2
  license: apache-2.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ train: false
4
+ inference: false
5
+ pipeline_tag: text-generation
6
  ---
7
+ ## Mixtral-8x7B-v0.1-hf-attn-4bit-moe-2bit-HQQ
8
+ This is a version of the Mixtral-8x7B-v0.1 model (https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ).
9
+
10
+ More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit. This simple change yields a huge improvement in perplexity vs the all 2-bit model (4.69 vs. 5.90) for a slight increase in model size (18.2GB vs. 18GB).
11
+ ### Basic Usage
12
+ To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows:
13
+ ``` Python
14
+ model_id = 'mobiuslabsgmbh/Mixtral-8x7B-v0.1-hf-attn-4bit-moe-2bit-HQQ'
15
+ #Load the model
16
+ from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
17
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
18
+ model = HQQModelForCausalLM.from_quantized(model_id)
19
+ #Optional
20
+ from hqq.core.quantize import *
21
+ HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE)
22
+ ```