Llama 3.3 70B Instruct (AutoRound GPTQ 4-bit)

This repository provides a 4-bit quantized version of the Llama 3.3 70B Instruct model using the AutoRound method and GPTQ quantization. This process results in a significantly smaller model footprint with negligible degradation in performance (as measured by MMLU zero-shot evaluations).

Model Description

Base Model: meta-llama/Llama-3.3-70B-Instruct

Quantization: 4-bit GPTQ with AutoRound

Group Size: 128
Symmetry: Enabled (sym=True)

This quantized model aims to preserve the capabilities and accuracy of the original Llama 3.3 70B Instruct model while drastically reducing the model size and computational overhead. By converting weights into a 4-bit representation with carefully selected quantization parameters, the model maintains near-original performance levels on challenging benchmarks.

Performance and Results

MMLU Zero-Shot Performance

  • Original Model (FP16): ~81.82%
  • 4-bit Quantized Model: ~81.93%

As shown above, the 4-bit quantized model achieved an MMLU zero-shot accuracy of 81.93%, which is effectively on par with the original FP16 model鈥檚 81.82%. Thus, the quantization process did not cause performance degradation based on this evaluation metric.

Model Size Reduction

  • Original FP16 Size: ~141.06 GB
  • 4-bit Quantized Size: ~39.77 GB

The quantized model is approximately 3.5x smaller than the original. This reduction significantly lowers storage requirements and can enable faster inference on more modest hardware.

Intended Use

Primary Use Cases:

  • Instruction following and content generation.
  • Conversational AI interfaces, virtual assistants, and chatbots.
  • Research and experimentation on large language models with reduced resource requirements.

Out-of-Scope Use Cases:

  • High-stakes decision-making without human review.
  • Scenarios requiring guaranteed factual correctness (e.g., medical or legal advice).
  • Generation of malicious or harmful content.

Limitations and Biases

Like the original Llama models, this quantized variant may exhibit:

  • Hallucinations: The model can produce factually incorrect or nonsensical outputs.
  • Biases: The model may reflect cultural, social, or other biases present in its training data.

Users should ensure proper oversight and consider the model鈥檚 responses critically. It鈥檚 not suitable for authoritative or mission-critical applications without additional safeguards.

How to Use

You can load the model using transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "Satwik11/Llama-3.3-70B-Instruct-AutoRound-GPTQ-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # or torch.bfloat16 if supported
    device_map="auto"
)

prompt = "Explain the concept of gravity to a 10-year-old."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
89
Safetensors
Model size
11.3B params
Tensor type
BF16
I32
FP16
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Satwik11/Llama-3.3-70B-Instruct-AutoRound-GPTQ-4bit

Quantized
(56)
this model