40 26 26

Xuan Son NGUYEN

ngxson

https://blog.ngxson.com

AI & ML interests

Doing AI for fun, not for profit

Recent Activity

updated a model about 21 hours ago

ngxson/Qwen2.5-7B-Instruct-1M-Q4_K_M-GGUF

published a model about 21 hours ago

ngxson/Qwen2.5-7B-Instruct-1M-Q4_K_M-GGUF

reacted to mitkox's post with 🚀 2 days ago

llama.cpp is 26.8% faster than ollama. I have upgraded both, and using the same settings, I am running the same DeepSeek R1 Distill 1.5B on the same hardware. It's an Apples to Apples comparison. Total duration: llama.cpp 6.85 sec <- 26.8% faster ollama 8.69 sec Breakdown by phase: Model loading llama.cpp 241 ms <- 2x faster ollama 553 ms Prompt processing llama.cpp 416.04 tokens/s with an eval time 45.67 ms <- 10x faster ollama 42.17 tokens/s with an eval time of 498 ms Token generation llama.cpp 137.79 tokens/s with an eval time 6.62 sec <- 13% faster ollama 122.07 tokens/s with an eval time 7.64 sec llama.cpp is LLM inference in C/C++; ollama adds abstraction layers and marketing. Make sure you own your AI. AI in the cloud is not aligned with you; it's aligned with the company that owns it.

View all activity

Articles

Code a simple RAG from scratch

Oct 29, 2024

• 19

Organizations

ngxson's activity

updated a model about 21 hours ago

ngxson/Qwen2.5-7B-Instruct-1M-Q4_K_M-GGUF

Text Generation • Updated about 21 hours ago • 18

published a model about 21 hours ago

ngxson/Qwen2.5-7B-Instruct-1M-Q4_K_M-GGUF

Text Generation • Updated about 21 hours ago • 18

reacted to mitkox's post with 🚀👍 2 days ago

Post

1807

llama.cpp is 26.8% faster than ollama.
I have upgraded both, and using the same settings, I am running the same DeepSeek R1 Distill 1.5B on the same hardware. It's an Apples to Apples comparison.

Total duration:
llama.cpp 6.85 sec <- 26.8% faster
ollama 8.69 sec

Breakdown by phase:
Model loading
llama.cpp 241 ms <- 2x faster
ollama 553 ms

Prompt processing
llama.cpp 416.04 tokens/s with an eval time 45.67 ms <- 10x faster
ollama 42.17 tokens/s with an eval time of 498 ms

Token generation
llama.cpp 137.79 tokens/s with an eval time 6.62 sec <- 13% faster
ollama 122.07 tokens/s with an eval time 7.64 sec

llama.cpp is LLM inference in C/C++; ollama adds abstraction layers and marketing.

Make sure you own your AI. AI in the cloud is not aligned with you; it's aligned with the company that owns it.