You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v4.48.2).
SpQR
SpQR quantization algorithm involves a 16x16 tiled bi-level group 3-bit quantization structure, with sparse outliers as detailed in SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression.
To SpQR-quantize a model, refer to the Vahe1994/SpQR repository.
Load a pre-SpQR-quantized model in from_pretrained().
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
quantized_model = AutoModelForCausalLM.from_pretrained(
"elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf",
torch_dtype=torch.half,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf")