Inference speed
#9
by
Iker
- opened
Hello! Thanks a lot for the quantized models!
In your blog post, you mention that:The 1.58bit quantization should fit in 160GB of VRAM for fast inference (2x H100 80GB), with it attaining around 140 tokens per second
I have tried to follow your instructions:
git clone https://github.com/ggerganov/llama.cpp.git
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON -DLLAMA_SERVER_SSL=ON
cmake --build llama.cpp/build --config Release -j 16 --clean-first -t llama-quantize llama-server llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
./llama.cpp/llama-cli \
--model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
--cache-type-k q4_0 \
--threads 12 -no-cnv --n-gpu-layers 62 --prio 2 \
--temp 0.6 \
--ctx-size 4096 \
--seed 3407 \
--prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
But I am getting around 10 tokens per second with 2x H100 80GB. Is there something that I am missing?
Hi there, apologies we didn't specify, it is 140 tokens per second for throughput.
for single user inference, it's 14/15 tokens per second
So increase the ctx-size by a factor of N
and run ./llama-server --parallel N
and saturate all N
slots simultaneously to achieve 140 tok/sec throughput?