Llama 3.3 70B Instruct (AutoRound GPTQ 4-bit)

This repository provides a 4-bit quantized version of the Llama 3.3 70B Instruct model using the AutoRound method and GPTQ quantization. This process results in a significantly smaller model footprint with negligible degradation in performance (as measured by MMLU zero-shot evaluations).

Model Description

Base Model: meta-llama/Llama-3.3-70B-Instruct

Quantization: 4-bit GPTQ with AutoRound

Group Size: 128
Symmetry: Enabled (sym=True)

This quantized model aims to preserve the capabilities and accuracy of the original Llama 3.3 70B Instruct model while drastically reducing the model size and computational overhead. By converting weights into a 4-bit representation with carefully selected quantization parameters, the model maintains near-original performance levels on challenging benchmarks.

Performance and Results

MMLU Zero-Shot Performance

Original Model (FP16): ~81.82%
4-bit Quantized Model: ~81.93%

As shown above, the 4-bit quantized model achieved an MMLU zero-shot accuracy of 81.93%, which is effectively on par with the original FP16 model’s 81.82%. Thus, the quantization process did not cause performance degradation based on this evaluation metric.

Model Size Reduction

Original FP16 Size: ~141.06 GB
4-bit Quantized Size: ~39.77 GB

The quantized model is approximately 3.5x smaller than the original. This reduction significantly lowers storage requirements and can enable faster inference on more modest hardware.

Intended Use

Primary Use Cases:

Instruction following and content generation.
Conversational AI interfaces, virtual assistants, and chatbots.
Research and experimentation on large language models with reduced resource requirements.

Out-of-Scope Use Cases:

High-stakes decision-making without human review.
Scenarios requiring guaranteed factual correctness (e.g., medical or legal advice).
Generation of malicious or harmful content.

Limitations and Biases

Like the original Llama models, this quantized variant may exhibit:

Hallucinations: The model can produce factually incorrect or nonsensical outputs.
Biases: The model may reflect cultural, social, or other biases present in its training data.

Users should ensure proper oversight and consider the model’s responses critically. It’s not suitable for authoritative or mission-critical applications without additional safeguards.

How to Use

You can load the model using transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "Satwik11/Llama-3.3-70B-Instruct-AutoRound-GPTQ-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # or torch.bfloat16 if supported
    device_map="auto"
)

prompt = "Explain the concept of gravity to a 10-year-old."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Satwik11
/

Llama-3.3-70B-Instruct-AutoRound-GPTQ-4bit