--- base_model: nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF library_name: transformers language: - en tags: - nvidia - llama-3 - pytorch license: other license_name: nvidia-open-model-license license_link: >- https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf pipeline_tag: text-generation quantized_by: ymcki --- Original model: https://huggingface.co./nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF ## Prompt Template ``` ### System: {system_prompt} ### User: {user_prompt} ### Assistant: ``` ***Important*** for people who wants to do their own quantitization. The convert_hf_to_gguf.py in b4380 of llama.cpp doesn't read rope_theta parameter such that it can't generate gguf that can work with prompts longer than 4k tokens. There is currently a [PR](https://huggingface.co./ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/convert_hf_to_gguf.py) in llama.cpp to update convert_hf_to_gguf.py. If you can't wait for the PR to get thru, you can download a working convert_hf_to_gguf.py from [here](https://huggingface.co./ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/convert_hf_to_gguf.py) in this repository before you do the gguf conversion yourself. Starting from [b4380](https://github.com/ggerganov/llama.cpp/archive/refs/tags/b4380.tar.gz) of llama.cpp, DeciLMForCausalLM's variable Grouped Query Attention is now supported. Please download it and compile it to run the GGUFs in this repository. This modification should support Llama-3_1-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has no_op or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers. Since I am a free user, so for the time being, I only upload models that might be of interest for most people. ## Download a file (not the whole branch) from below: Perplexity for f16 gguf is 6.646565 ± 0.040986. | Quant Type | imatrix | File Size | Delta Perplexity | KL Divergence | Description | | ---------- | ------- | ----------| ---------------- | ------------- | ----------- | | [Q6_K](https://huggingface.co./ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q6_K.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 42.26GB | -0.002436 ± 0.001565 | 0.003332 ± 0.000014 | Good for Nvidia cards or Apple Silicon with 48GB RAM. Should perform very close to the original | | [Q5_K_M](https://huggingface.co./ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q5_K_M.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 36.47GB | 0.020310 ± 0.002052 | 0.005642 ± 0.000024 | Good for A100 40GB or dual 3090. Better than Q4_K_M but larger and slower. | | [Q4_K_M](https://huggingface.co./ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_K_M.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 31.04GB | 0.055444 ± 0.002982 | 0.012021 ± 0.000052 | Good for A100 40GB or dual 3090. Higher cost performance ratio than Q5_K_M. | | IQ4_NL | calibration_datav3 | 29.30GB | 0.088279 ± 0.003944 | 0.020314 ± 0.000093 | For 32GB cards, e.g. 5090. Minor performance gain doesn't justify its use over IQ4_XS | | [IQ4_XS](https://huggingface.co./ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ4_XS.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 27.74GB | 0.095486 ± 0.004039 | 0.020962 ± 0.000097 | For 32GB cards, e.g. 5090. Too slow for CPU and Apple. Recommended. | | Q4_0 | calibration_datav3 | 29.34GB | 0.543042 ± 0.009290 | 0.077602 ± 0.000389 | For 32GB cards, e.g. 5090. Too slow for CPU and Apple. | | [IQ3_M](https://huggingface.co./ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_M.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 23.49GB | 0.313812 ± 0.006299 | 0.054266 ± 0.000205 | Largest model that can fit a single 3090 at 5k context. Not recommeneded for CPU or Apple Silicon due to high computational cost. | | [IQ3_S](https://huggingface.co./ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_S.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 22.65GB | 0.434774 ± 0.007162 | 0.069264 ± 0.000242 | Largest model that can fit a single 3090 at 7k context. Not recommended for CPU or Apple Silicon due to high computational cost. | | [IQ3_XXS](https://huggingface.co./ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_XXS.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 20.19GB | 0.638630 ± 0.009693 | 0.092827 ± 0.000367 | Largest model that can fit a single 3090 at 13k context. Not recommended for CPU or Apple Silicon due to high computational cost. | | Q3_K_S | calibration_datav3 | 22.65GB | 0.698971 ± 0.010387 | 0.089605 ± 0.000443 | Largest model that can fit a single 3090 that performs well in all platforms | | Q3_K_S | none | 22.65GB | 2.224537 ± 0.024868 | 0.283028 ± 0.001220 | Largest model that can fit a single 3090 without imatrix | ## Convert safetensors to f16 gguf Make sure you have llama.cpp git cloned: ``` python3 convert_hf_to_gguf.py Llama-3_1-Nemotron 51B-Instruct/ --outfile Llama-3_1-Nemotron 51B-Instruct.f16.gguf --outtype f16 ``` ## Convert f16 gguf to Q4_0 gguf without imatrix Make sure you have llama.cpp compiled: ``` ./llama-quantize Llama-3_1-Nemotron 51B-Instruct.f16.gguf Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf q4_0 ``` ## Convert f16 gguf to Q4_0 gguf with imatrix Make sure you have llama.cpp compiled. Then create an imatrix with a dataset. ``` ./llama-imatrix -m Llama-3_1-Nemotron-51B-Instruct.f16.gguf -f calibration_datav3.txt -o Llama-3_1-Nemotron-51B-Instruct.imatrix --chunks 32 ``` Then convert with the created imatrix. ``` ./llama-quantize Llama-3_1-Nemotron-51B-Instruct.f16.gguf --imatrix Llama-3_1-Nemotron-51B-Instruct.imatrix Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_0.gguf q4_0 ``` ## Calculate perplexity and KL divergence First, download wikitext. ``` bash ./scripts/get-wikitext-2.sh ``` Second, find the base values of F16 gguf. Please be warned that the generated base value file is about 10GB. Adjust GPU layers depending on your VRAM. ``` ./llama-perplexity --kl-divergence-base Llama-3_1-Nemotron-51B-Instruct.f16.kld -m Llama-3_1-Nemotron-51B-Instruct.f16.gguf -f wikitext-2-raw/wiki.test.raw -ngl 100 ``` Finally, calculate the perplexity and KL divergence of Q4_0 gguf. Adjust GPU layers depending on your VRAM. ``` ./llama-perplexity --kl-divergence-base Llama-3_1-Nemotron-51B-Instruct.f16.kld --kl_divergence -m Llama-3_1-Nemotron-51B-Instruct.Q4_0.gguf -ngl 100 >& Llama-3_1-Nemotron-51B-Instruct.Q4_0.kld ``` ## Downloading using huggingface-cli First, make sure you have hugginface-cli installed: ``` pip install -U "huggingface_hub[cli]" ``` Then, you can target the specific file you want: ``` huggingface-cli download ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF --include "Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf" --local-dir ./ ``` ## Running the model using llama-cli First, go to llama.cpp [release page](https://github.com/ggerganov/llama.cpp/releases) and download the appropriate pre-compiled release starting from b4380. If that doesn't work, then download any version of llama.cpp starting from [b4380](https://github.com/ggerganov/llama.cpp/archive/refs/tags/b4380.tar.gz). Compile it, then run ``` ./llama-cli -m ~/Llama-3_1-Nemotron-51B-Instruct.Q3_K_S.gguf -p 'You are a European History Professor named Professor Whitman.' -cnv -ngl 100 ``` ## Credits Thank you bartowski for providing a README.md to get me started.