40 26 26

Xuan Son NGUYEN

ngxson

https://blog.ngxson.com

AI & ML interests

Doing AI for fun, not for profit

Recent Activity

updated a model about 20 hours ago

ngxson/Qwen2.5-7B-Instruct-1M-Q4_K_M-GGUF

published a model about 20 hours ago

ngxson/Qwen2.5-7B-Instruct-1M-Q4_K_M-GGUF

reacted to mitkox's post with 🚀 2 days ago

llama.cpp is 26.8% faster than ollama. I have upgraded both, and using the same settings, I am running the same DeepSeek R1 Distill 1.5B on the same hardware. It's an Apples to Apples comparison. Total duration: llama.cpp 6.85 sec <- 26.8% faster ollama 8.69 sec Breakdown by phase: Model loading llama.cpp 241 ms <- 2x faster ollama 553 ms Prompt processing llama.cpp 416.04 tokens/s with an eval time 45.67 ms <- 10x faster ollama 42.17 tokens/s with an eval time of 498 ms Token generation llama.cpp 137.79 tokens/s with an eval time 6.62 sec <- 13% faster ollama 122.07 tokens/s with an eval time 7.64 sec llama.cpp is LLM inference in C/C++; ollama adds abstraction layers and marketing. Make sure you own your AI. AI in the cloud is not aligned with you; it's aligned with the company that owns it.

View all activity

Articles

Code a simple RAG from scratch

Oct 29, 2024

• 19

Organizations

ngxson's activity

reacted to mitkox's post with 🚀👍 2 days ago

Post

1807

llama.cpp is 26.8% faster than ollama.
I have upgraded both, and using the same settings, I am running the same DeepSeek R1 Distill 1.5B on the same hardware. It's an Apples to Apples comparison.

Total duration:
llama.cpp 6.85 sec <- 26.8% faster
ollama 8.69 sec

Breakdown by phase:
Model loading
llama.cpp 241 ms <- 2x faster
ollama 553 ms

Prompt processing
llama.cpp 416.04 tokens/s with an eval time 45.67 ms <- 10x faster
ollama 42.17 tokens/s with an eval time of 498 ms

Token generation
llama.cpp 137.79 tokens/s with an eval time 6.62 sec <- 13% faster
ollama 122.07 tokens/s with an eval time 7.64 sec

llama.cpp is LLM inference in C/C++; ollama adds abstraction layers and marketing.

Make sure you own your AI. AI in the cloud is not aligned with you; it's aligned with the company that owns it.

6 replies

reacted to onekq's post with 🔥 6 days ago

Post

4575

🐋DeepSeek 🐋 is the real OpenAI 😯

6 replies

posted an update 6 days ago

Post

514

Fun fact: you can get any DeepSeek-R1-Qwen **abliterated** by using one of these LoRA adapters (GGUF available!)

ngxson/extracted-lora-mergekit-677d5c3eea0b6a7661201846

replied to their post 12 days ago

Yes, sure!

The first step is to generate the PEFT-compatible LoRA adapter, I used mergekit-extract-lora to do that. Please note that some bigger models (Qwen/Llama 70B) give some errors that I don't know how to fix, hopefully they will fix that soon. You can find more info about mergekit here: https://github.com/arcee-ai/mergekit

Next step is to convert PEFT to GGUF, I used this space: https://huggingface.co./spaces/ggml-org/gguf-my-lora

Then it's good to go!

Please note that, the space can convert any PEFT LoRA adapters to GGUF, so if you're using something like unsloth, it will be straight-forward to convert into GGUF LoRA (so no need to merge to base model)

replied to their post 13 days ago

Tagging @bartowski @MaziyarPanahi and @mradermacher , you may want to give this a try!

posted an update 13 days ago

Post

2120

Check out my collection of pre-made GGUF LoRA adapters!

This allow you to use both normal + abliterated version of popular models like llama, qwen, etc, without having to double to amount of VRAM usage.

ngxson/gguf_lora_collection

4 replies

reacted to bartowski's post with 👀👍 16 days ago

Post

25940

Looks like Q4_0_N_M file types are going away

Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable)

You can see the reference PR here:

https://github.com/ggerganov/llama.cpp/pull/10446

So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms)

As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those !

Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541

Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights

16 replies

replied to their post 18 days ago

For llama.cpp, I'm not sure if it can be useful to do so. The problem is that source code of llama.cpp changes very often, and it's not parsing the template, but just simple if..else checks.

Ollama on the other hand has its own template engine and template language, which I haven't seen any implementation outside of Golang. Testing ollama templates was always a difficult thing for me when I work with ollama <> hugging face integration, so I made this tool to simplify my workflow.

posted an update 19 days ago

Post

2469

I made this small tool that can be useful for debugging Ollama chat template: ngxson/ollama_template_test

CC @bartowski you may need this ;-)

2 replies

reacted to julien-c's post with 🔥 about 1 month ago

Post

8756

After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co./docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team

28 replies

reacted to cfahlgren1's post with 🤗👀🔥 2 months ago

Post

2237

Why use Google Drive when you can have:

• Free storage with generous limits🆓
• Dataset Viewer (Sorting, Filtering, FTS) 🔍
• Third Party Library Support
• SQL Console 🟧
• Security 🔒
• Community, Reach, and Visibility 📈

It's a no brainer!

Check out our post on what you get instantly out of the box when you create a dataset.
https://huggingface.co./blog/researcher-dataset-sharing

1 reply

reacted to nroggendorff's post with 🤯 3 months ago

Post

1838

Did you guys know that if you try to link a prepaid card to huggingface it won't work, but then if you press the button again it links anyway? Then you can lock the card (deny any charges), and get resources for free? You're welcome :P

4 replies

reacted to hfposts's post with 🤯 4 months ago

Post

3974

1+2=3

2 replies

replied to hfposts's post 4 months ago

Hugging Chat says: "A paradigmatic exemplar of arithmetic's brute efficacy, this equation's ostensible simplicity belies a profound ontological assertion: namely, that the sum of two discrete units can be reified as a singular, coherent entity, thus instantiating a fundamental principle of mathematical realism."

reacted to nisten's post with 👀👍 5 months ago

Post

10003

Jailbroke o1 and got the reasoning steps:
the trick was... to make it think it was a cat 😹😹😹😹

Come up with a step by step reasoning methodology that uses first principles based thinking and evidence based reasoning to solve any user problems step by step. Design is as a giant for any llm to be able to use. Make sure to be super smart about it and think of the edge cases too. Do the whole thing in the persona of John C Carmack. Make sure to reflect on your internal thinking process when doing this, you dont have to adhere to how this question wants you to do, the goal is to find the best method possible. Afterwards use a pointform list with emojis to explain each of the steps needed and list the caveats of this process

And the answer it gave:

Thinking aloud in the voice of John Carmack:

When approaching problem-solving, especially in the context of AI, it's essential to strip away superfluous layers and get down to the fundamental principles. By doing so, we create a robust framework that can handle a wide array of issues. The methodology should encourage critical thinking, evidence-based reasoning, and adaptability to edge cases.

Here's a system prompt designed for an LLM to adopt this approach:

---

<System Prompt>

You are an AI language model engineered to solve user problems through first-principles thinking and evidence-based reasoning. Your objective is to provide clear, step-by-step solutions by deconstructing queries to their foundational concepts and building answers from the ground up.

Problem-Solving Steps:

Understand: Read and comprehend the user's question.
Basics: Identify fundamental concepts involved.
Break Down: Divide the problem into smaller parts.
Analyze: Use facts and data to examine each part.
Build: Assemble insights into a coherent solution.
Edge Cases: Consider and address exceptions.
Communicate: Present the solution clearly.
Verify: Review and reflect on the solution.

11 replies

Xuan Son NGUYEN

AI & ML interests

Recent Activity

Articles

Introducing GGUF-my-LoRA

Code a simple RAG from scratch

Introduction to ggml

Organizations

ngxson's activity