HIGGS
HIGGS is a 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper arxiv.org/abs/2411.17525.
Runtime support for HIGGS is implemented through FLUTE, and its library.
Quantization Example
from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-9b-it",
quantization_config=HiggsConfig(bits=4),
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
tokenizer.decode(model.generate(
**tokenizer("Hi,", return_tensors="pt").to(model.device),
temperature=0.5,
top_p=0.80,
)[0])
Pre-quantized models
Some pre-quantized models can be found in the official collection on Hugging Face Hub.
Current Limitations
Architectures
Currently, FLUTE, and HIGGS by extension, only support Llama 3 and 3.0 of 8B, 70B and 405B parameters, as well as Gemma-2 9B and 27B. We’re working on allowing to run more diverse models as well as allow arbitrary models by modifying the FLUTE compilation procedure.
torch.compile
HIGGS is fully compatible with torch.compile
. Compiling model.forward
, as described here, here’re the speedups it provides on RTX 4090 for Llama-3.1-8B-Instruct
(forward passes/sec):
Batch Size | BF16 (With torch.compile ) | HIGGS 4bit (No torch.compile ) | HIGGS 4bit (With torch.compile ) |
---|---|---|---|
1 | 59 | 41 | 124 |
4 | 57 | 42 | 123 |
16 | 56 | 41 | 120 |
Quantized training
Currently, HIGGS doesn’t support quantized training (and backward passes in general). We’re working on adding support for it.
< > Update on GitHub