Transformers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.48.2).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Fine-grained FP8

With FP8 quantization method, you can quantize your model in FP8 (W8A8):

the weights will be quantized in 8bit (FP8) per 2D block (e.g. weight_block_size=(128, 128)) which is inspired from the deepseek implementation
Activations are quantized to 8 bits (FP8) per group per token, with the group value matching that of the weights in the input channels (128 by default)

It’s implemented to add support for DeepSeek-V3 and DeepSeek-R1 models, you can see the paper here, and the image below explains the quantization scheme :

You need a GPU with compute capability>=9 (e.g. H100)

Before you begin, make sure the following libraries are installed with their latest version:

pip install --upgrade accelerate torch

You need to install a torch version compatible with the cuda version of your GPU.

By default, the weights are loaded in full precision (torch.float32) regardless of the actual data type the weights are stored in such as torch.float16. Set torch_dtype="auto" to load the weights in the data type defined in a model’s config.json file to automatically load the most memory-optimal data type.

from transformers import FineGrainedFP8Config, AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Meta-Llama-3-8B"
quantization_config = FineGrainedFP8Config()
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)

tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

output = quantized_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))

A quantized model can be saved via “saved_pretrained” and be reused again via the “from_pretrained”.

quant_path = "/path/to/save/quantized/model"
model.save_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")

< > Update on GitHub

←compressed-tensors Contribute new quantization method→