This model has been quantized using GPTQModel.
- bits: 4
- group_size: 128
- desc_act: true
- static_groups: false
- sym: true
- lm_head: false
- damp_percent: 0.01
- true_sequential: true
- model_name_or_path: ""
- model_file_base_name: "model"
- quant_method: "gptq"
- checkpoint_format: "gptq"
- meta:
- quantizer: "gptqmodel:0.9.9-dev0"
Here is an example:
import torch
from transformers import AutoTokenizer
from gptqmodel import GPTQModel
device = torch.device("cuda:0")
model_name = "ModelCloud/Meta-Llama-3.1-8B-gptq-4bit"
prompt = "I am in Shanghai, preparing to visit the natural history museum. Can you tell me the best way to"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = GPTQModel.from_quantized(model_name)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
res = model.generate(**inputs, num_beams=1, min_new_tokens=1, max_new_tokens=512)
print(tokenizer.decode(res[0]))
- Downloads last month
- 29
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.