Xuan Son NGUYEN
AI & ML interests
Recent Activity
Articles
Organizations
ngxson's activity
I have upgraded both, and using the same settings, I am running the same DeepSeek R1 Distill 1.5B on the same hardware. It's an Apples to Apples comparison.
Total duration:
llama.cpp 6.85 sec <- 26.8% faster
ollama 8.69 sec
Breakdown by phase:
Model loading
llama.cpp 241 ms <- 2x faster
ollama 553 ms
Prompt processing
llama.cpp 416.04 tokens/s with an eval time 45.67 ms <- 10x faster
ollama 42.17 tokens/s with an eval time of 498 ms
Token generation
llama.cpp 137.79 tokens/s with an eval time 6.62 sec <- 13% faster
ollama 122.07 tokens/s with an eval time 7.64 sec
llama.cpp is LLM inference in C/C++; ollama adds abstraction layers and marketing.
Make sure you own your AI. AI in the cloud is not aligned with you; it's aligned with the company that owns it.
ngxson/extracted-lora-mergekit-677d5c3eea0b6a7661201846
Yes, sure!
The first step is to generate the PEFT-compatible LoRA adapter, I used mergekit-extract-lora
to do that. Please note that some bigger models (Qwen/Llama 70B) give some errors that I don't know how to fix, hopefully they will fix that soon. You can find more info about mergekit here: https://github.com/arcee-ai/mergekit
Next step is to convert PEFT to GGUF, I used this space: https://huggingface.co./spaces/ggml-org/gguf-my-lora
Then it's good to go!
Please note that, the space can convert any PEFT LoRA adapters to GGUF, so if you're using something like unsloth, it will be straight-forward to convert into GGUF LoRA (so no need to merge to base model)
Tagging @bartowski @MaziyarPanahi and @mradermacher , you may want to give this a try!
This allow you to use both normal + abliterated version of popular models like llama, qwen, etc, without having to double to amount of VRAM usage.
ngxson/gguf_lora_collection
Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable)
You can see the reference PR here:
https://github.com/ggerganov/llama.cpp/pull/10446
So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms)
As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those !
Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541
Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights
For llama.cpp, I'm not sure if it can be useful to do so. The problem is that source code of llama.cpp changes very often, and it's not parsing the template, but just simple if..else checks.
Ollama on the other hand has its own template engine and template language, which I haven't seen any implementation outside of Golang. Testing ollama templates was always a difficult thing for me when I work with ollama <> hugging face integration, so I made this tool to simplify my workflow.
CC @bartowski you may need this ;-)
TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)
docs: https://huggingface.co./docs/hub/storage-limits
We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community π₯
cc: @reach-vb @pierric @victor and the HF team
β’ Free storage with generous limitsπ
β’ Dataset Viewer (Sorting, Filtering, FTS) π
β’ Third Party Library Support
β’ SQL Console π§
β’ Security π
β’ Community, Reach, and Visibility π
It's a no brainer!
Check out our post on what you get instantly out of the box when you create a dataset.
https://huggingface.co./blog/researcher-dataset-sharing
Hugging Chat says: "A paradigmatic exemplar of arithmetic's brute efficacy, this equation's ostensible simplicity belies a profound ontological assertion: namely, that the sum of two discrete units can be reified as a singular, coherent entity, thus instantiating a fundamental principle of mathematical realism."
the trick was... to make it think it was a cat πΉπΉπΉπΉ
Come up with a step by step reasoning methodology that uses first principles based thinking and evidence based reasoning to solve any user problems step by step. Design is as a giant for any llm to be able to use. Make sure to be super smart about it and think of the edge cases too. Do the whole thing in the persona of John C Carmack. Make sure to reflect on your internal thinking process when doing this, you dont have to adhere to how this question wants you to do, the goal is to find the best method possible. Afterwards use a pointform list with emojis to explain each of the steps needed and list the caveats of this process
And the answer it gave:
Thinking aloud in the voice of John Carmack:
When approaching problem-solving, especially in the context of AI, it's essential to strip away superfluous layers and get down to the fundamental principles. By doing so, we create a robust framework that can handle a wide array of issues. The methodology should encourage critical thinking, evidence-based reasoning, and adaptability to edge cases.
Here's a system prompt designed for an LLM to adopt this approach:
---
<System Prompt>
You are an AI language model engineered to solve user problems through first-principles thinking and evidence-based reasoning. Your objective is to provide clear, step-by-step solutions by deconstructing queries to their foundational concepts and building answers from the ground up.
Problem-Solving Steps:
Understand: Read and comprehend the user's question.
Basics: Identify fundamental concepts involved.
Break Down: Divide the problem into smaller parts.
Analyze: Use facts and data to examine each part.
Build: Assemble insights into a coherent solution.
Edge Cases: Consider and address exceptions.
Communicate: Present the solution clearly.
Verify: Review and reflect on the solution.