Original model: https://huggingface.co./nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF

Prompt Template

### System:
{system_prompt}
### User:
{user_prompt}
### Assistant:

Important for people who wants to do their own quantitization. There is a typo in tokenizer_config.json of the original model that mistakenly set eos_token to '<|eot_id|>' when it should be '<|end_of_text|>'. Please fix it or overwrite with the tokenizer_config.json in this repository before you do the gguf conversion yourself.

Starting from b4380 of llama.cpp, DeciLMForCausalLM's variable Grouped Query Attention is now supported.. Please download it and compile it to run the GGUFs in this repository.

This modification should support Llama-3_1-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has no_op or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers.

Since I am a free user, so for the time being, I only upload models that might be of interest for most people.

Download a file (not the whole branch) from below:

Perplexity for f16 gguf is 6.646565 ยฑ 0.040986.

Quant Type imatrix File Size Delta Perplexity KL Divergence Description
Q6_K calibration_datav3 42.26GB -0.002436 ยฑ 0.001565 0.003332 ยฑ 0.000014 Good for Nvidia cards or Apple Silicon with 48GB RAM. Should perform very close to the original
Q5_K_M calibration_datav3 36.47GB 0.020310 ยฑ 0.002052 0.005642 ยฑ 0.000024 Good for A100 40GB or dual 3090. Better than Q4_K_M but larger and slower.
Q4_K_M calibration_datav3 31.04GB 0.055444 ยฑ 0.002982 0.012021 ยฑ 0.000052 Good for A100 40GB or dual 3090. Higher cost performance ratio than Q5_K_M.
IQ4_NL calibration_datav3 29.30GB 0.088279 ยฑ 0.003944 0.020314 ยฑ 0.000093 For 32GB cards, e.g. 5090. Minor performance gain doesn't justify its use over IQ4_XS
IQ4_XS calibration_datav3 27.74GB 0.095486 ยฑ 0.004039 0.020962 ยฑ 0.000097 For 32GB cards, e.g. 5090. Too slow for CPU and Apple. Recommended.
Q4_0 calibration_datav3 29.34GB 0.543042 ยฑ 0.009290 0.077602 ยฑ 0.000389 For 32GB cards, e.g. 5090. Too slow for CPU and Apple.
Q4_0_4_8 calibration_datav3 29.25GB Same as Q4_0 assumed Same as Q4_0 assumed For Apple Silicon
IQ3_M calibration_datav3 23.5GB 0.313812 ยฑ 0.006299 0.054266 ยฑ 0.000205 Largest model that can fit a single 3090 at 4k context. Not recommeneded for CPU or Apple Silicon due to high computational cost.
IQ3_S calibration_datav3 22.7GB 0.434774 ยฑ 0.007162 0.069264 ยฑ 0.000242 Largest model that can fit a single 3090 at 8k context. Not recommended for CPU or Apple Silicon due to high computational cost.
Q3_K_S calibration_datav3 22.7GB 0.698971 ยฑ 0.010387 0.089605 ยฑ 0.000443 Largest model that can fit a single 3090 that performs well in all platforms
Q3_K_S none 22.7GB 2.224537 ยฑ 0.024868 0.283028 ยฑ 0.001220 Largest model that can fit a single 3090 without imatrix

How to check i8mm support for Apple devices

ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM architecture >= ARMv8.6-A supports i8mm. That means Apple Silicon from A15 and M2 works best with Q4_0_4_8.

For Apple devices,

sysctl hw

On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0 than the other ggufs. That means for GPU inference, you better off using Q4_0.

Which Q4_0 model to use for Apple devices

Brand Series Model i8mm sve Quant Type
Apple A A4 to A14 No No Q4_0_4_4
Apple A A15 to A18 Yes No Q4_0_4_8
Apple M M1 No No Q4_0_4_4
Apple M M2/M3/M4 Yes No Q4_0_4_8

Convert safetensors to f16 gguf

Make sure you have llama.cpp git cloned:

python3 convert_hf_to_gguf.py Llama-3_1-Nemotron 51B-Instruct/ --outfile Llama-3_1-Nemotron 51B-Instruct.f16.gguf --outtype f16

Convert f16 gguf to Q4_0 gguf without imatrix

Make sure you have llama.cpp compiled:

./llama-quantize Llama-3_1-Nemotron 51B-Instruct.f16.gguf Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf q4_0

Convert f16 gguf to Q4_0 gguf with imatrix

Make sure you have llama.cpp compiled. Then create an imatrix with a dataset.

./llama-imatrix -m Llama-3_1-Nemotron-51B-Instruct.f16.gguf -f calibration_datav3.txt -o Llama-3_1-Nemotron-51B-Instruct.imatrix --chunks 32

Then convert with the created imatrix.

./llama-quantize Llama-3_1-Nemotron-51B-Instruct.f16.gguf --imatrix Llama-3_1-Nemotron-51B-Instruct.imatrix Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_0.gguf q4_0

Calculate perplexity and KL divergence

First, download wikitext.

bash ./scripts/get-wikitext-2.sh

Second, find the base values of F16 gguf. Please be warned that the generated base value file is about 10GB. Adjust GPU layers depending on your VRAM.

./llama-perplexity --kl-divergence-base Llama-3_1-Nemotron-51B-Instruct.f16.kld -m Llama-3_1-Nemotron-51B-Instruct.f16.gguf  -f wikitext-2-raw/wiki.test.raw -ngl 100

Finally, calculate the perplexity and KL divergence of Q4_0 gguf. Adjust GPU layers depending on your VRAM.

./llama-perplexity --kl-divergence-base Llama-3_1-Nemotron-51B-Instruct.f16.kld --kl_divergence -m Llama-3_1-Nemotron-51B-Instruct.Q4_0.gguf -ngl 100 >& Llama-3_1-Nemotron-51B-Instruct.Q4_0.kld

Downloading using huggingface-cli

First, make sure you have hugginface-cli installed:

pip install -U "huggingface_hub[cli]"

Then, you can target the specific file you want:

huggingface-cli download ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF --include "Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf" --local-dir ./

Running the model using llama-cli

First, go to llama.cpp release page and download the appropriate pre-compiled release starting from b4380. If that doesn't work, then download any version of llama.cpp starting from b4380. Compile it, then run

./llama-cli -m ~/Llama-3_1-Nemotron-51B-Instruct.Q3_K_S.gguf -p 'You are a European History Professor named Professor Whitman.'  -cnv -ngl 100

Credits

Thank you bartowski for providing a README.md to get me started.

Downloads last month
2,090
GGUF
Model size
51.5B params
Architecture
deci

3-bit

4-bit

5-bit

6-bit

Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.