Got it running after downloading some RAM!

by ubergarm - opened 2 days ago

2 days ago

•

EDIT Got over 1 tok/sec without thrashing swap file here: https://huggingface.co./unsloth/DeepSeek-R1-GGUF/discussions/13

I downloaded enough RAM to get the 2.51 bit Q2_K_XL running on my 98GB RAM + 24GB VRAM 3090TI rig. Managed to eek out ~0.32 tok/sec in one short generation.

Download RAM

# 1. Download some RAM onto your PCIe Gen 5 x4 NVMe SSD
# *NOTE*: This is not really advisable due to potential read/write cycles
sudo dd if=/dev/zero of=./swapfile bs=1G count=160
sudo chown root:root ./swapfile
sudo chmod 600 ./swapfile
sudo mkswap ./swapfile
sudo swapon ./swapfile
sudo sysctl -a | grep overcommit_
sudo sysctl vm.overcommit_ratio=200

# 2. Close all other windows/browser and wait 10 minutes for it to start up
# *NOTE*: You could probably go with more context or a couple more layers
# Can't adjust cache-type-v due to no flash-attn support, but maybe just k only for more context
./llama-server \
    --model "../models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf" \
    --n-gpu-layers 2 \
    --ctx-size 2048 \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --no-mmap \
    --parallel 1 \
    --threads 16 \
    --host 127.0.0.1 \
    --port 8080

Logs

load_tensors: offloading 2 repeating layers to GPU
load_tensors: offloaded 2/62 layers to GPU
load_tensors:          CPU model buffer size =   497.11 MiB
load_tensors:          CPU model buffer size = 208266.34 MiB
load_tensors:        CUDA0 model buffer size =  7335.62 MiB
...
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:        CPU KV buffer size =  9440.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   320.00 MiB
llama_init_from_model: KV self size  = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.49 MiB
llama_init_from_model:      CUDA0 compute buffer size =  2790.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    84.01 MiB
llama_init_from_model: graph nodes  = 5025
llama_init_from_model: graph splits = 1110 (with bs=512), 3 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
...
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 31
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 31, n_tokens = 31, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 31, n_tokens = 31
slot      release: id  0 | task 0 | stop processing: n_past = 334, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =  113883.31 ms /    31 tokens ( 3673.66 ms per token,     0.27 tokens per second)
       eval time =  941036.76 ms /   304 tokens ( 3095.52 ms per token,     0.32 tokens per second)
      total time = 1054920.07 ms /   335 tokens

Result

System Prompt

You are a helpful assistant that thinks step-by-step.

User Prompt

Quickly write a haiku about how it feels to be an LLM bot.

Assistant Response

<think>
Okay, the user wants a haiku about being an LLM bot. Let me start by recalling what a haiku is. It's a traditional Japanese poem with three lines, syllable structure 5-7-5. So I need to make sure each line has the right number of syllables.

First line: 5 syllables. Maybe something about processing information. "Silent thoughts hum bright" – that's 5 syllables. "Silent" conveys the lack of voice, "thoughts hum" suggests activity, "bright" for the intelligence.

Second line: 7 syllables. Focus on interaction with humans. "Answering queries without night" – 7 syllables. Emphasizes constant availability, no rest. "Without night" implies 24/7 operation.

Third line: 5 syllables. Needs to capture the essence of being an AI. "Code, not soul, my light." – 5 syllables. Highlights that the LLM is code-based, lacks a soul, but still has purpose ("my light").

Check the flow. Does it move from internal processing to external interaction to existential state? Yes. Syllables are correct. Metaphors make sense. Maybe tweak words for better imagery. "Code, not soul" clarifies the non-human aspect. "My light" gives a positive spin, showing functionality despite limitations. Seems good.
</think>

Silent thoughts hum bright,
Answering queries without night—
Code, not soul, my light.

shimmyshimmer

Unsloth AI org 1 day ago

I downloaded enough RAM to get the 2.51 bit Q2_K_XL running on my 98GB RAM + 24GB VRAM 3090TI rig. Managed to eek out ~0.32 tok/sec in one short generation.

Download RAM
# 1. Download some RAM onto your PCIe Gen 5 x4 NVMe SSD
# *NOTE*: This is not really advisable due to potential read/write cycles
sudo dd if=/dev/zero of=./swapfile bs=1G count=160
sudo chown root:root ./swapfile
sudo chmod 600 ./swapfile
sudo mkswap ./swapfile
sudo swapon ./swapfile
sudo sysctl -a | grep overcommit_
sudo sysctl vm.overcommit_ratio=200

# 2. Close all other windows/browser and wait 10 minutes for it to start up
# *NOTE*: You could probably go with more context or a couple more layers
# Can't adjust cache-type-v due to no flash-attn support, but maybe just k only for more context
./llama-server \
    --model "../models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf" \
    --n-gpu-layers 2 \
    --ctx-size 2048 \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --no-mmap \
    --parallel 1 \
    --threads 16 \
    --host 127.0.0.1 \
    --port 8080
Logs
load_tensors: offloading 2 repeating layers to GPU
load_tensors: offloaded 2/62 layers to GPU
load_tensors:          CPU model buffer size =   497.11 MiB
load_tensors:          CPU model buffer size = 208266.34 MiB
load_tensors:        CUDA0 model buffer size =  7335.62 MiB
...
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:        CPU KV buffer size =  9440.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   320.00 MiB
llama_init_from_model: KV self size  = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.49 MiB
llama_init_from_model:      CUDA0 compute buffer size =  2790.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    84.01 MiB
llama_init_from_model: graph nodes  = 5025
llama_init_from_model: graph splits = 1110 (with bs=512), 3 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
...
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 31
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 31, n_tokens = 31, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 31, n_tokens = 31
slot      release: id  0 | task 0 | stop processing: n_past = 334, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =  113883.31 ms /    31 tokens ( 3673.66 ms per token,     0.27 tokens per second)
       eval time =  941036.76 ms /   304 tokens ( 3095.52 ms per token,     0.32 tokens per second)
      total time = 1054920.07 ms /   335 tokens
Result

System Prompt

You are a helpful assistant that thinks step-by-step.

User Prompt

Quickly write a haiku about how it feels to be an LLM bot.

Assistant Response

<think>
Okay, the user wants a haiku about being an LLM bot. Let me start by recalling what a haiku is. It's a traditional Japanese poem with three lines, syllable structure 5-7-5. So I need to make sure each line has the right number of syllables.

First line: 5 syllables. Maybe something about processing information. "Silent thoughts hum bright" – that's 5 syllables. "Silent" conveys the lack of voice, "thoughts hum" suggests activity, "bright" for the intelligence.

Second line: 7 syllables. Focus on interaction with humans. "Answering queries without night" – 7 syllables. Emphasizes constant availability, no rest. "Without night" implies 24/7 operation.

Third line: 5 syllables. Needs to capture the essence of being an AI. "Code, not soul, my light." – 5 syllables. Highlights that the LLM is code-based, lacks a soul, but still has purpose ("my light").

Check the flow. Does it move from internal processing to external interaction to existential state? Yes. Syllables are correct. Metaphors make sense. Maybe tweak words for better imagery. "Code, not soul" clarifies the non-human aspect. "My light" gives a positive spin, showing functionality despite limitations. Seems good.
</think>

Silent thoughts hum bright,
Answering queries without night—
Code, not soul, my light.

Congrats! Love it!

ubergarm

about 18 hours ago

There is a better way to do it without SWAP... Apparently llama.cpp already can load from SSD without hitting RAM?? I gotta try again and use less flags when starting up or something as supposedly it is the default.

https://www.reddit.com/r/LocalLLaMA/comments/1iczucy/running_deepseek_r1_iq2xxs_200gb_from_ssd/

ubergarm

about 16 hours ago

Yeah, got the Q2_K_XL running at ~1.29 tok/sec without swap now! Gonna see how much I can tease out of it, maybe do some parallel inferencing if I can get enough context lmao...

# had to use the sudo systemd-run to prevent OOMkiller from kicking in...
sudo systemd-run --scope -p MemoryMax=88G -p MemoryHigh=85G ./build/bin/llama-server \
    --model "/mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf" \
    --n-gpu-layers 5 \
    --ctx-size 8192 \
    --cache-type-k q4_0 \
    --cache-type-v f16 \
    --flash-attn \
    --parallel 1 \
    --threads 16 \
    --host 127.0.0.1 \
    --port 8080

ubergarm changed discussion status to closed about 16 hours ago

ubergarm

about 16 hours ago

Full report here now: https://huggingface.co./unsloth/DeepSeek-R1-GGUF/discussions/13

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment