Edit model card

Model Card for llama-13b-hf-35q_4bit-128g_WVU

Model Description

llama-13b-hf-35q_4bit-128g_WVU is a model based on the Llama architecture with 13 billion parameters. This model adopts a quantization in which the first 35 layers of the decoder have been quantized with the gptq method, which uses 4-bit precision and 128 groups. Then, the last 5 decoder layers (1/8 of decoding layers), and lm_head have been fine-tuned using the wizard_vicuna_70k_unfiltered dataset, 1 epoch.

Note

Quantization effectively reduces memory usage, however, it may result in differences in the parameters. Additionally, fine-tuning only the last few layers lowers memory requirements for training but could lead to minor performance degradation.

Several alternatives exist for fine-tuning and quantizing the Llama models. The specific method utilized here—quantizing several layers, followed by fine-tuning the last few layers—is designed to account for errors introduced during quantization (which sometimes can result in unexpected answers), and enables the last few layers to be fine-tuned considering both the quantization error and the dataset.

It is worth mentioning that other methods may yield superior performance. For instance:

  1. Fine-tuning the entire model for X epochs
  2. Quantizing the first K layers
  3. Fine-tuning the remaining layers for Y epochs

Nonetheless, as fine-tuning the entire model requires considerable resources (for example, 4 GPUs with 80GB VRAM is required for 7B LLaMa), this model omit the first step from the method described above, and it works.

Using the Model

To load the model, a custom LlamaForCausalLM is required. You can find quantized llama here.

References

  1. Meta - LLaMA
  2. WizardLM
  3. GPTQ for LLaMa
  4. Wizard Vicuna Unfiltered Dataset
  5. Various unlisted but great works, researches, and projects.
Downloads last month
13
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.