--- base_model: meta-llama/Meta-Llama-3-8B-Instruct language: - en license: llama3.1 pipeline_tag: text-generation tags: - int8 - w8a8 - text-generation --- # Meta-Llama-3-8B-Instruct-quantized.w4a4 ## Model Overview - **Model Architecture:** Meta-Llama-3 - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Weight and Activation Quantization:** INT8 (W8A8) - **Intended Use Cases:** Intended for commercial and research use across multiple languages, designed to function as an assistant-like chat model. - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). - **Release Date:** 9/2024 - **Version:** 1.0 - **License(s):** Llama3.1 - **Model Developers:** Mahesh Yaddanapudi Quantized version of [Meta-Llama-3-8B-Instruct](https://huggingface.co./meta-llama/Meta-Llama-3-8B-Instruct). This model is optimized using weight and activation quantization to INT8, drastically reducing memory usage and enabling deployment on extremely resource-constrained environments. ### Model Optimizations This model was obtained by quantizing the weights and activations of [Meta-Llama-3-8B-Instruct](https://huggingface.co./meta-llama/Meta-Llama-3-8B-Instruct) to INT8 (W8A8) data type. This optimization reduces the number of bits per parameter and activation from 16 to 8, significantly reducing disk size and memory requirements. The weights and activations of the linear operators within transformers blocks are quantized using the [GPTQ](https://arxiv.org/abs/2210.17323) algorithm, which applies symmetric per-channel quantization with a 1% damping factor and 256 sequences of 8,192 random tokens. ## Deployment This model can be deployed efficiently using various backends compatible with INT8 models, as shown in the example below. ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "zzzmahesh/Meta-Llama-3-8B-Instruct-quantized.w8a8" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", low_cpu_mem_usage=True ) prompt = "What are the benefits of model quantization in AI?" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0])) ``` ## Creation This model was created by using the GPTQ quantization method as implemented in the AutoGPTQ library, as demonstrated in the code snippet below. ```python from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig import random model_id = "meta-llama/Meta-Llama-3-8B-Instruct" # Create random examples for quantization calibration num_samples = 256 max_seq_len = 8192 tokenizer = AutoTokenizer.from_pretrained(model_id) max_token_id = len(tokenizer.get_vocab()) - 1 examples = [{"input_ids": [random.randint(0, max_token_id) for _ in range(max_seq_len)], "attention_mask": max_seq_len * [1]} for _ in range(num_samples)] # Define quantization configuration for W8A8 quantize_config = BaseQuantizeConfig(bits=8, group_size=-1, desc_act=True, model_file_base_name="model", damp_percent=0.01) # Load and quantize the model model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config, device_map="auto") model.quantize(examples) model.save_pretrained("Meta-Llama-3-8B-Instruct-quantized.w8a8") ``` ## Future Work Further evaluations are planned to compare this quantized model with its unquantized and higher-bit quantized counterparts, especially on benchmarks relevant to code generation and logical reasoning tasks.