Model Card for llama-13b-hf-35q_4bit-128g_WVU

Model Description

llama-13b-hf-35q_4bit-128g_WVU is a model based on the Llama architecture with 13 billion parameters. This model adopts a quantization in which the first 35 layers of the decoder have been quantized with the gptq method, which uses 4-bit precision and 128 groups. Then, the last 5 decoder layers (1/8 of decoding layers), and lm_head have been fine-tuned using the wizard_vicuna_70k_unfiltered dataset, 1 epoch.

Note

Quantization effectively reduces memory usage, however, it may result in differences in the parameters. Additionally, fine-tuning only the last few layers lowers memory requirements for training but could lead to minor performance degradation.

Several alternatives exist for fine-tuning and quantizing the Llama models. The specific method utilized here—quantizing several layers, followed by fine-tuning the last few layers—is designed to account for errors introduced during quantization (which sometimes can result in unexpected answers), and enables the last few layers to be fine-tuned considering both the quantization error and the dataset.

It is worth mentioning that other methods may yield superior performance. For instance:

Fine-tuning the entire model for X epochs
Quantizing the first K layers
Fine-tuning the remaining layers for Y epochs

Nonetheless, as fine-tuning the entire model requires considerable resources (for example, 4 GPUs with 80GB VRAM is required for 7B LLaMa), this model omit the first step from the method described above, and it works.

Using the Model

To load the model, a custom LlamaForCausalLM is required. You can find quantized llama here.

References

Meta - LLaMA
WizardLM
GPTQ for LLaMa
Wizard Vicuna Unfiltered Dataset
Various unlisted but great works, researches, and projects.