Inference speed

#9
by Iker - opened

Hello! Thanks a lot for the quantized models!

In your blog post, you mention that:
The 1.58bit quantization should fit in 160GB of VRAM for fast inference (2x H100 80GB), with it attaining around 140 tokens per second

I have tried to follow your instructions:

git clone https://github.com/ggerganov/llama.cpp.git

cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON -DLLAMA_SERVER_SSL=ON

cmake --build llama.cpp/build --config Release -j 16 --clean-first -t llama-quantize llama-server llama-cli llama-gguf-split

cp llama.cpp/build/bin/llama-* llama.cpp

./llama.cpp/llama-cli \
    --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    --cache-type-k q4_0 \
    --threads 12 -no-cnv --n-gpu-layers 62 --prio 2 \
    --temp 0.6 \
    --ctx-size 4096 \
    --seed 3407 \
    --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"

But I am getting around 10 tokens per second with 2x H100 80GB. Is there something that I am missing?

Unsloth AI org

Hi there, apologies we didn't specify, it is 140 tokens per second for throughput.

for single user inference, it's 14/15 tokens per second

So increase the ctx-size by a factor of N and run ./llama-server --parallel N and saturate all N slots simultaneously to achieve 140 tok/sec throughput?

Sign up or log in to comment