File size: 3,592 Bytes
4a79ebe
938d259
 
 
 
 
a242e04
 
 
 
4a79ebe
 
a242e04
4a79ebe
a242e04
 
 
 
 
 
 
 
 
 
 
 
4a79ebe
a242e04
4a79ebe
a242e04
4a79ebe
a242e04
4a79ebe
a242e04
4a79ebe
a242e04
4a79ebe
a242e04
4a79ebe
a242e04
 
4a79ebe
a242e04
4a79ebe
a242e04
 
 
 
 
 
4a79ebe
a242e04
 
 
4a79ebe
a242e04
 
4a79ebe
a242e04
4a79ebe
a242e04
4a79ebe
a242e04
 
 
 
4a79ebe
a242e04
4a79ebe
a242e04
 
 
 
 
 
4a79ebe
a242e04
 
4a79ebe
a242e04
 
 
 
 
4a79ebe
a242e04
4a79ebe
a242e04
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
base_model: meta-llama/Meta-Llama-3-8B-Instruct
language:
- en
license: llama3.1
pipeline_tag: text-generation
tags:
- int8
- w8a8
- text-generation
---

# Meta-Llama-3-8B-Instruct-quantized.w4a4

## Model Overview
- **Model Architecture:** Meta-Llama-3
  - **Input:** Text
  - **Output:** Text
- **Model Optimizations:**
  - **Weight and Activation Quantization:** INT8 (W8A8)
- **Intended Use Cases:** Intended for commercial and research use across multiple languages, designed to function as an assistant-like chat model.
- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
- **Release Date:** 9/2024
- **Version:** 1.0
- **License(s):** Llama3.1
- **Model Developers:** Mahesh Yaddanapudi

Quantized version of [Meta-Llama-3-8B-Instruct](https://huggingface.co./meta-llama/Meta-Llama-3-8B-Instruct). This model is optimized using weight and activation quantization to INT8, drastically reducing memory usage and enabling deployment on extremely resource-constrained environments.

### Model Optimizations

This model was obtained by quantizing the weights and activations of [Meta-Llama-3-8B-Instruct](https://huggingface.co./meta-llama/Meta-Llama-3-8B-Instruct) to INT8 (W8A8) data type. This optimization reduces the number of bits per parameter and activation from 16 to 8, significantly reducing disk size and memory requirements.

The weights and activations of the linear operators within transformers blocks are quantized using the [GPTQ](https://arxiv.org/abs/2210.17323) algorithm, which applies symmetric per-channel quantization with a 1% damping factor and 256 sequences of 8,192 random tokens.

## Deployment

This model can be deployed efficiently using various backends compatible with INT8 models, as shown in the example below.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "zzzmahesh/Meta-Llama-3-8B-Instruct-quantized.w8a8"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    low_cpu_mem_usage=True
)

prompt = "What are the benefits of model quantization in AI?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0]))
```

## Creation

This model was created by using the GPTQ quantization method as implemented in the AutoGPTQ library, as demonstrated in the code snippet below.

```python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import random

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Create random examples for quantization calibration
num_samples = 256
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_id)
max_token_id = len(tokenizer.get_vocab()) - 1
examples = [{"input_ids": [random.randint(0, max_token_id) for _ in range(max_seq_len)], "attention_mask": max_seq_len * [1]} for _ in range(num_samples)]

# Define quantization configuration for W8A8
quantize_config = BaseQuantizeConfig(bits=8, group_size=-1, desc_act=True, model_file_base_name="model", damp_percent=0.01)

# Load and quantize the model
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config, device_map="auto")
model.quantize(examples)
model.save_pretrained("Meta-Llama-3-8B-Instruct-quantized.w8a8")
```

## Future Work

Further evaluations are planned to compare this quantized model with its unquantized and higher-bit quantized counterparts, especially on benchmarks relevant to code generation and logical reasoning tasks.