12 17 1

Yi Cui

onekq

https://onekq.ai

AI & ML interests

Benchmark, Code Generation Model

Recent Activity

updated a Space 1 day ago

onekq-ai/WebApp1K-models-leaderboard

posted an update 2 days ago

QwQ-32B is amazing! It ranks below o1-preview, but beats DeepSeek v3 and all Gemini models. https://huggingface.co./spaces/onekq-ai/WebApp1K-models-leaderboard Now we have such a powerful model that can fit into a single GPU, can someone finetune a web app model to push SOTA of my leaderboard? 🤗

posted an update 3 days ago

From my own experience these are the pain points for reasoning model adoption. (1) expensive and even worse, slow, due to excessive token output. You need to 10x your max output length to avoid clipping the thinking process. (2) you have to filter thinking tokens to retrieve the final output. For mature workflows, this means broad or deep refactoring. 1p vendors (open-source and proprietary) ease these pain points by manipulating their own models. But the problems are exposed when the reasoning model is hosted by 3p MaaS providers.

View all activity

Organizations

onekq's activity

posted an update 2 days ago

Post

2491

QwQ-32B is amazing!

It ranks below o1-preview, but beats DeepSeek v3 and all Gemini models.
onekq-ai/WebApp1K-models-leaderboard

Now we have such a powerful model that can fit into a single GPU, can someone finetune a web app model to push SOTA of my leaderboard? 🤗

posted an update 3 days ago

Post

493

From my own experience these are the pain points for reasoning model adoption.

(1) expensive and even worse, slow, due to excessive token output. You need to 10x your max output length to avoid clipping the thinking process.

(2) you have to filter thinking tokens to retrieve the final output. For mature workflows, this means broad or deep refactoring.

1p vendors (open-source and proprietary) ease these pain points by manipulating their own models. But the problems are exposed when the reasoning model is hosted by 3p MaaS providers.

posted an update 4 days ago

Post

318

The bitter lesson (🏆Sutton🏆) should be the core value of all ML institutions and individuals.

posted an update 6 days ago

Post

2476

I was puzzled by the scope of 🐋DeepSeek🐋 projects, i.e. why they built (then open sourced) so many pieces which are all over their technology stack. Good engineers are minimalists. They build only when they have to.

Then I realized that FP8 should be the main driving force here. So your raw inter-GPU bandwidth is cut in half (H800). But if you compress your data presentation from 16 bits to 8 bits, then the effective throughput of your workload stays unchanged!

The idea is simple but lots of work had to be done. Their v3 technical report will give you a wholistic view (better than reading the code). To summarize, data structure is the foundation to any software. Since FP8 was new and untried, the ecosystem wasn't there. So DeepSeek became the trailblazer. Before cooking your meals, you need to till the land, grow crops, and grind the flour 😅

posted an update 7 days ago

Post

577

H800 is all you need.

This is my summary to 🐋DeepSeek🐋 open source week. H800 is as good as H100, except the NVLink bandwidth is cut in half.

This is a crystal clear challenge, and it rallied and motivated innovations which follow. The rest are details.

posted an update 9 days ago

Post

501

GPT 4.5 has pulled off a pretty decent performance (on a par with Claude 3.7) but apparently there is no new SOTA. OAI already stated that GPT 4.5 is not a frontier model.
onekq-ai/WebApp1K-models-leaderboard

No SOTA for new models by both OAI and Anthropic. This is not a coincidence. You cannot make everyone happy when more and more workflows and applications use a single model.

Vertical models will inevitably rise.

posted an update 12 days ago

Post

2753

Necessity is mother of invention. To understand ⚡FlashMLA⚡ by
🐋DeepSeek 🐋, the first question to ask is why.

The keyword here is H800, a lower-end product tailored for export control. The purpose here is to squeeze out as much performance as possible.

But here is the most important takeaway: this invention benefits EVERYONE.

2 replies

replied to their post 12 days ago

posted an update 13 days ago

Post

2070

Huge disappointment to Claude Sonnet 3.7 😞 Big performance regression. Worse than the June version in 2024. 👎
onekq-ai/WebApp1K-models-leaderboard

I'm sure though this version improves on something, only not the thing my leaderboard measures. This proves the point that no model can be the best on everything.

2 replies

posted an update 17 days ago

Post

2049

Still waiting for 👽Grok👽 3 API ⌛😞😫

replied to their post 21 days ago

Done. So I understand this: you do not change model weights, but rather tweak the inference logic? Somehow remind me of speculative decoding.

replied to their post 24 days ago

Sure, this is what I intend to do.

But a HF 🤗 collection cannot include anything outside HF 🤗. It has to be a dataset, model, space, or paper. Do you have anything like those?

posted an update 24 days ago

Post

1782

R1 is still trending. Here is a collection of works trying to replicate R1.
onekq-ai/r1-reproduction-works-67a93f2fb8b21202c9eedf0b

Players include Huggingface (Open R1), Stanford (simple scaling), Berkeley (Bespoke, Open thoughts, etc.), ServiceNow, etc. I know there is another work from HKUST but couldn't find it on 🤗. Let me know if I miss any teams.

5 replies

replied to their post about 1 month ago

In my case I asked both models to write code. The model is good if the code passes tests. What are your prompts?

https://huggingface.co./datasets/onekq-ai/WebApp1K-Duo-React

I know though Anthropic weighs in on safety.

replied to their post about 1 month ago

And their python package too 😜

Having AI to do the refactor is a great idea though. It will be breaking change if you switch your model from non-reasoning to reasoning.

posted an update about 1 month ago

Post

1682

o3-mini is slightly better than R1, but lags behind Claude. Sorry folks, no new SOTA 😕

But OAI definitely owns the fashion of API. temperature and top_p are history now, reasoning_effort will be copied by other vendors.

onekq-ai/WebApp1K-models-leaderboard

4 replies

posted an update about 1 month ago

Post

1315

Mistral Small 3 is SUPER fast, and highest score for 20+b model, but still 11 points below Qwen 2.5 coder 32b.

I believe specialty model is the future. The more you know what to do with the model, the better bang you can get for your buck. If Mistral scopes this small model to coding only, I'm confident they can beat Qwen.

One day my leaderboard will be dominated by smol models excellent on one thing, not monolithic ones costing $$$. And I'm looking forward to that.

onekq-ai/WebApp1K-models-leaderboard

1 reply

replied to their post about 1 month ago

Adding Qwen2.5-Max

posted an update about 1 month ago

Post

2301

So 🐋DeepSeek🐋 hits the mainstream media. But it has been a star in our little cult for at least 6 months. Its meteoric success is not overnight, but two years in the making.

To learn their history, just look at their 🤗 repo https://huggingface.co./deepseek-ai

* End of 2023, they launched the first model (pretrained by themselves) following Llama 2 architecture
* June 2024, v2 (MoE architecture) surpassed Gemini 1.5, but behind Mistral
* September, v2.5 surpassed GPT 4o mini
* December, v3 surpassed GPT 4o
* Now R1 surpassed o1

Most importantly, if you think DeepSeek success is singular and unrivaled, that's WRONG. The following models are also near or equal the o1 bar.

* Minimax-01
* Kimi k1.5
* Doubao 1.5 pro

1 reply

reacted to clem's post with 🔥 about 1 month ago

Post

2452

The 🐳 just crossed 10,000 followers on HF

https://huggingface.co./deepseek-ai