Got it running after downloading some RAM!
EDIT Got over 1 tok/sec without thrashing swap file here: https://huggingface.co./unsloth/DeepSeek-R1-GGUF/discussions/13
I downloaded enough RAM to get the 2.51 bit Q2_K_XL running on my 98GB RAM + 24GB VRAM 3090TI rig. Managed to eek out ~0.32 tok/sec in one short generation.
Download RAM
# 1. Download some RAM onto your PCIe Gen 5 x4 NVMe SSD
# *NOTE*: This is not really advisable due to potential read/write cycles
sudo dd if=/dev/zero of=./swapfile bs=1G count=160
sudo chown root:root ./swapfile
sudo chmod 600 ./swapfile
sudo mkswap ./swapfile
sudo swapon ./swapfile
sudo sysctl -a | grep overcommit_
sudo sysctl vm.overcommit_ratio=200
# 2. Close all other windows/browser and wait 10 minutes for it to start up
# *NOTE*: You could probably go with more context or a couple more layers
# Can't adjust cache-type-v due to no flash-attn support, but maybe just k only for more context
./llama-server \
--model "../models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf" \
--n-gpu-layers 2 \
--ctx-size 2048 \
--cache-type-k f16 \
--cache-type-v f16 \
--no-mmap \
--parallel 1 \
--threads 16 \
--host 127.0.0.1 \
--port 8080
Logs
load_tensors: offloading 2 repeating layers to GPU
load_tensors: offloaded 2/62 layers to GPU
load_tensors: CPU model buffer size = 497.11 MiB
load_tensors: CPU model buffer size = 208266.34 MiB
load_tensors: CUDA0 model buffer size = 7335.62 MiB
...
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init: CPU KV buffer size = 9440.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 320.00 MiB
llama_init_from_model: KV self size = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB
llama_init_from_model: CPU output buffer size = 0.49 MiB
llama_init_from_model: CUDA0 compute buffer size = 2790.00 MiB
llama_init_from_model: CUDA_Host compute buffer size = 84.01 MiB
llama_init_from_model: graph nodes = 5025
llama_init_from_model: graph splits = 1110 (with bs=512), 3 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
...
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 31
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 31, n_tokens = 31, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 31, n_tokens = 31
slot release: id 0 | task 0 | stop processing: n_past = 334, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 113883.31 ms / 31 tokens ( 3673.66 ms per token, 0.27 tokens per second)
eval time = 941036.76 ms / 304 tokens ( 3095.52 ms per token, 0.32 tokens per second)
total time = 1054920.07 ms / 335 tokens
Result
System Prompt
You are a helpful assistant that thinks step-by-step.
User Prompt
Quickly write a haiku about how it feels to be an LLM bot.
Assistant Response
<think>
Okay, the user wants a haiku about being an LLM bot. Let me start by recalling what a haiku is. It's a traditional Japanese poem with three lines, syllable structure 5-7-5. So I need to make sure each line has the right number of syllables.
First line: 5 syllables. Maybe something about processing information. "Silent thoughts hum bright" – that's 5 syllables. "Silent" conveys the lack of voice, "thoughts hum" suggests activity, "bright" for the intelligence.
Second line: 7 syllables. Focus on interaction with humans. "Answering queries without night" – 7 syllables. Emphasizes constant availability, no rest. "Without night" implies 24/7 operation.
Third line: 5 syllables. Needs to capture the essence of being an AI. "Code, not soul, my light." – 5 syllables. Highlights that the LLM is code-based, lacks a soul, but still has purpose ("my light").
Check the flow. Does it move from internal processing to external interaction to existential state? Yes. Syllables are correct. Metaphors make sense. Maybe tweak words for better imagery. "Code, not soul" clarifies the non-human aspect. "My light" gives a positive spin, showing functionality despite limitations. Seems good.
</think>
Silent thoughts hum bright,
Answering queries without night—
Code, not soul, my light.
I downloaded enough RAM to get the 2.51 bit Q2_K_XL running on my 98GB RAM + 24GB VRAM 3090TI rig. Managed to eek out ~0.32 tok/sec in one short generation.
Download RAM
# 1. Download some RAM onto your PCIe Gen 5 x4 NVMe SSD # *NOTE*: This is not really advisable due to potential read/write cycles sudo dd if=/dev/zero of=./swapfile bs=1G count=160 sudo chown root:root ./swapfile sudo chmod 600 ./swapfile sudo mkswap ./swapfile sudo swapon ./swapfile sudo sysctl -a | grep overcommit_ sudo sysctl vm.overcommit_ratio=200 # 2. Close all other windows/browser and wait 10 minutes for it to start up # *NOTE*: You could probably go with more context or a couple more layers # Can't adjust cache-type-v due to no flash-attn support, but maybe just k only for more context ./llama-server \ --model "../models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf" \ --n-gpu-layers 2 \ --ctx-size 2048 \ --cache-type-k f16 \ --cache-type-v f16 \ --no-mmap \ --parallel 1 \ --threads 16 \ --host 127.0.0.1 \ --port 8080
Logs
load_tensors: offloading 2 repeating layers to GPU load_tensors: offloaded 2/62 layers to GPU load_tensors: CPU model buffer size = 497.11 MiB load_tensors: CPU model buffer size = 208266.34 MiB load_tensors: CUDA0 model buffer size = 7335.62 MiB ... llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0 llama_kv_cache_init: CPU KV buffer size = 9440.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 320.00 MiB llama_init_from_model: KV self size = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB llama_init_from_model: CPU output buffer size = 0.49 MiB llama_init_from_model: CUDA0 compute buffer size = 2790.00 MiB llama_init_from_model: CUDA_Host compute buffer size = 84.01 MiB llama_init_from_model: graph nodes = 5025 llama_init_from_model: graph splits = 1110 (with bs=512), 3 (with bs=1) common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048 ... slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 31 slot update_slots: id 0 | task 0 | kv cache rm [0, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 31, n_tokens = 31, progress = 1.000000 slot update_slots: id 0 | task 0 | prompt done, n_past = 31, n_tokens = 31 slot release: id 0 | task 0 | stop processing: n_past = 334, truncated = 0 slot print_timing: id 0 | task 0 | prompt eval time = 113883.31 ms / 31 tokens ( 3673.66 ms per token, 0.27 tokens per second) eval time = 941036.76 ms / 304 tokens ( 3095.52 ms per token, 0.32 tokens per second) total time = 1054920.07 ms / 335 tokens
Result
System Prompt
You are a helpful assistant that thinks step-by-step.
User Prompt
Quickly write a haiku about how it feels to be an LLM bot.
Assistant Response
<think>
Okay, the user wants a haiku about being an LLM bot. Let me start by recalling what a haiku is. It's a traditional Japanese poem with three lines, syllable structure 5-7-5. So I need to make sure each line has the right number of syllables.First line: 5 syllables. Maybe something about processing information. "Silent thoughts hum bright" – that's 5 syllables. "Silent" conveys the lack of voice, "thoughts hum" suggests activity, "bright" for the intelligence.
Second line: 7 syllables. Focus on interaction with humans. "Answering queries without night" – 7 syllables. Emphasizes constant availability, no rest. "Without night" implies 24/7 operation.
Third line: 5 syllables. Needs to capture the essence of being an AI. "Code, not soul, my light." – 5 syllables. Highlights that the LLM is code-based, lacks a soul, but still has purpose ("my light").
Check the flow. Does it move from internal processing to external interaction to existential state? Yes. Syllables are correct. Metaphors make sense. Maybe tweak words for better imagery. "Code, not soul" clarifies the non-human aspect. "My light" gives a positive spin, showing functionality despite limitations. Seems good.
</think>Silent thoughts hum bright,
Answering queries without night—
Code, not soul, my light.
Congrats! Love it!
There is a better way to do it without SWAP... Apparently llama.cpp already can load from SSD without hitting RAM?? I gotta try again and use less flags when starting up or something as supposedly it is the default.
https://www.reddit.com/r/LocalLLaMA/comments/1iczucy/running_deepseek_r1_iq2xxs_200gb_from_ssd/
Yeah, got the Q2_K_XL
running at ~1.29 tok/sec without swap now! Gonna see how much I can tease out of it, maybe do some parallel inferencing if I can get enough context lmao...
# had to use the sudo systemd-run to prevent OOMkiller from kicking in...
sudo systemd-run --scope -p MemoryMax=88G -p MemoryHigh=85G ./build/bin/llama-server \
--model "/mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf" \
--n-gpu-layers 5 \
--ctx-size 8192 \
--cache-type-k q4_0 \
--cache-type-v f16 \
--flash-attn \
--parallel 1 \
--threads 16 \
--host 127.0.0.1 \
--port 8080
Full report here now: https://huggingface.co./unsloth/DeepSeek-R1-GGUF/discussions/13