πΊπ¦ββ¬ LLM Comparison/Test: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs
Introduction
I've been working tirelessly on my latest research project, comparing 25 state-of-the-art large language models by running them through the respected MMLU-Pro benchmark's computer science category. This involved:
- 59 separate benchmark runs
- Over 70 hours of total runtime
- Testing 25 different LLMs, including:
- Latest models from Anthropic, Google, Alibaba, OpenAI, Mistral, Meta, and others
- Multiple model sizes (parameters and quantization)
- With and without speculative decoding (a technique that can speed up inference without compromising output quality)
The goal was to thoroughly and systematically evaluate these models to:
- Determine which performs best on computer science tasks as a proxy for general intelligence
- Compare open vs closed source models
- Analyze the impact of model size and quantization choices
- Measure the benefits of speculative decoding for inference speed
- Provide a detailed analysis of the results (and surprises!)
I started this project in November and have been continuously expanding the models tested while updating my findings.
The release of QwQ particularly caught my attention, as this unique model demonstrated exceptional performance in preliminary testing, warranting deeper analysis and more extensive evaluation.
While I could continue refining and expanding this research indefinitely, I've chosen to consolidate my key findings into a focused blog post that reflects the current state of my research. This analysis represents one of the most comprehensive evaluations of large language models to date, providing valuable insights for researchers and practitioners looking to assess these models for their specific needs or implement them in real-world applications.
About the Benchmark
The MMLU-Pro benchmark is a comprehensive evaluation of large language models across various categories, including computer science, mathematics, physics, chemistry, and more. It's designed to assess a model's ability to understand and apply knowledge across a wide range of subjects, providing a robust measure of general intelligence. While it is a multiple choice test, instead of 4 answer options like in its predecessor MMLU, there are now 10 options per question, which drastically reduces the probability of correct answers by chance. Additionally, the focus is increasingly on complex reasoning tasks rather than pure factual knowledge.
For my benchmarks, I currently limit myself to the Computer Science category with its 410 questions. This pragmatic decision is based on several factors: First, I place particular emphasis on responses from my usual work environment, since I frequently use these models in this context during my daily work. Second, with local models running on consumer hardware, there are practical constraints around computation time - a single run already takes several hours with larger models, and I generally conduct at least two runs to ensure consistency.
Unlike typical benchmarks that only report single scores, I conduct multiple test runs for each model to capture performance variability. This comprehensive approach delivers a more accurate and nuanced understanding of each model's true capabilities. By executing at least two benchmark runs per model, I establish a robust assessment of both performance levels and consistency. The results feature error bars that show standard deviation, illustrating how performance varies across different test runs.
The benchmarks for this study alone required over 70 hours of runtime. With additional categories or runs, the testing duration would have become so long with the available resources that the tested models would have been outdated by the time the study was completed. Therefore, establishing practical framework conditions and boundaries is essential to achieve meaningful results within a reasonable timeframe.
Best Models
While what's best depends on the specific use case, these benchmarks offer a comprehensive overview of the current state-of-the-art in Large Language Models. Let's examine the graph at the top of this page highlighting the performance comparison among leading models:
The graph ranks models by their average score, with error bars indicating standard deviation. "Online" models are exclusively accessible through API providers such as Anthropic, Google, or OpenAI, while "Local" models can be downloaded directly from Hugging Face and run on your own hardware. The "Both" category indicates that these LLMs are available for both local deployment and through cloud API services like Azure, IONOS (a German provider especially relevant for GDPR-compliant applications requiring national cloud infrastructure), or Mistral.
Claude 3.5 Sonnet (20241022) stands out as the current top performer, which perfectly matches my hands-on experience. I've continuously used both versions of Claude 3.5 Sonnet (the original 20240620 and the updated 20241022) since their respective launches and consistently find it to be the most reliable and versatile solution across diverse applications. Based on its exceptional performance, I recommend it as the go-to model for most of my clients, provided online models are an option.
Gemini 1.5 Pro 002 demonstrates excellent performance, ranking second overall. While Google's latest experimental models reportedly achieve even better results, rate limits during benchmark testing prevented a proper evaluation of their capabilities.
QwQ 32B Preview is the best local model, surpassing many online models in performance. This is as amazing as it is surprising, as it's only a (relatively) small 32B model but outperforms all other local models in these benchmarks, including much larger 70B, 123B, or even 405B models. It even surpasses the online models from OpenAI (I could only test ChatGPT/GPT-4o) as well as the excellent Mistral models (which have always been among my personal favorites due to their outstanding multilingual capabilities).
The graph shows QwQ 32B Preview in various configurations with different settings and parameters. The 8.0bpw (8.0 bits per weight) version performs best (it's the largest available in EXL2 format), provided - and this is a major finding - the model is given enough room (max_tokens=16K) to "think"! This is QwQ's unique ability: It's capable of using chain of thought and self-reflection to arrive at the correct answer, without being specifically prompted to do so.
Consequently, QwQ performs worse in MMLU-Pro (and likely other benchmarks) if its output is truncated prematurely, which can easily happen with smaller output limits - MMLU-Pro's default is max_tokens=2K! This affects smaller quants more severely, as they aren't as intelligent as the 8.0bpw version and need to think longer (i.e., write more tokens) to arrive at the correct answer.
Athene V2 Chat is another excellent model, but it's not as stable as QwQ 32B Preview at 8-bit with max_tokens=16K. Its highest score slightly surpasses QwQ 32B Preview's, but QwQ is more consistent and has less variance, ranking higher in the graph based on the average score. It's also a 72B model, so much larger than QwQ 32B Preview.
Qwen 2.5 72B Instruct, from the same Alibaba team behind QwQ, performs exceptionally well. Even quantized down to 4.65bpw to fit my 48 GB VRAM, it outperforms most other models in these benchmarks. The Qwen team is clearly leading in open-weights models, ahead of Meta and Mistral.
GPT-4o (2024-08-06) appears lower than expected, and surprisingly, this older version performed better in the benchmark than the latest ChatGPT version or its more recent iteration (2024-11-20).
Mistral Large 2407, a 123B model, follows GPT-4o. Like GPT-4o, this older version outperformed the latest version (2411) in the benchmark. This raises questions about whether newer models are trading intelligence for better writing or speed.
Llama 3.1 405B Instruct (FP8) is the next best local model. As the largest local model, its performance falls short of expectations, especially considering the resources it requires to run.
Mistral Large 2411, the newer version, slightly trails its older counterpart. While I appreciate their models for their excellent writing and multilingual capabilities, Qwen has taken the lead, especially considering Mistral's size and research-only license.
ChatGPT-4o (latest) is the API version of the current ChatGPT website model. Its benchmark was conducted on 2024-11-18, using the version available at that time.
Online models can be updated at any time, making versioned models a more reliable choice. Even with versioned models, providers may still modify parameters like quantization, settings, and safety guardrails without notice. For maximum consistency and full control, running models locally remains the only option!
GPT-4o (2024-11-20) is the latest version of GPT-4o. Again, it's curious that a newer version shows lower benchmark performance compared to its predecessor. Looks like they traded quality for speed.
Llama 3.1 70B Instruct is the next best local model. As a 70B model, it's relatively large but still runnable locally, especially when quantized. However, this benchmark used an online, unquantized version, representing its maximum performance.
Gemini 1.5 Flash 002, Google's compact model, delivers performance that reflects its smaller size - falling short of its Pro counterpart. Nevertheless, it impressively outperforms Meta's Llama 3.2 90B, demonstrating that smaller models can achieve remarkable results.
Llama 3.2 90B Vision Instruct represents Meta's multimodal evolution of Llama - essentially an enhanced Llama 3.1 70B with integrated vision capabilities. While its performance varies slightly, it maintains comparable effectiveness to the 70B version.
Qwen Coder 32B Instruct is another outstanding model in the Qwen family, specifically optimized for coding tasks. While it shares the same 32B parameter size as QwQ 32B Preview, it scores lower on this benchmark. This difference in performance is natural, as computer science knowledge and coding capabilities are distinct skill sets - specialized models often show reduced performance in broader domains outside their focus area.
Mistral Small 2409, a 22B parameter model, ranks last behind the QwQ and Mistral Large variants in these tests. I established a minimum threshold of 50% correct answers for inclusion in this benchmark set, making it the final model to qualify for analysis.
Detailed Results
Now that we've reviewed the rankings, let's explore the detailed results and uncover additional insights. Here's the complete table:
Model | HF Main Model Name | HF Draft Model Name (speculative decoding) | Size | Format | API | GPU | GPU Mem | Run | Duration | Total | % | TIGER-Lab | Correct Random Guesses | Prompt tokens | tk/s | Completion tokens | tk/s |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
claude-3-5-sonnet-20241022 | - | - | - | - | Anthropic | - | - | 1/2 | 31m 50s | 340/410 | 82.93% | ~= 82.44% | 694458 | 362.78 | 97438 | 50.90 | |
claude-3-5-sonnet-20241022 | - | - | - | - | Anthropic | - | - | 2/2 | 31m 39s | 338/410 | 82.44% | == 82.44% | 694458 | 364.82 | 97314 | 51.12 | |
gemini-1.5-pro-002 | - | - | - | - | Gemini | - | - | 1/2 | 31m 7s | 335/410 | 81.71% | > 71.22% | 648675 | 346.82 | 78311 | 41.87 | |
gemini-1.5-pro-002 | - | - | - | - | Gemini | - | - | 2/2 | 30m 40s | 327/410 | 79.76% | > 71.22% | 648675 | 351.73 | 76063 | 41.24 | |
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384) | bartowski/QwQ-32B-Preview-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 38436MiB | 1/2 | 2h 3m 30s | 325/410 | 79.27% | 0/2, 0.00% | 656716 | 88.58 | 327825 | 44.22 | |
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384) | bartowski/QwQ-32B-Preview-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 38436MiB | 2/2 | 2h 3m 35s | 324/410 | 79.02% | 656716 | 88.52 | 343440 | 46.29 | ||
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache) | wolfram/Athene-V2-Chat-4.65bpw-h6-exl2 | - | 72B | EXL2 | TabbyAPI | RTX 6000 | 44496MiB | 1/2 | 2h 13m 5s | 326/410 | 79.51% | > 73.41% | 656716 | 82.21 | 142256 | 17.81 | |
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache) | wolfram/Athene-V2-Chat-4.65bpw-h6-exl2 | - | 72B | EXL2 | TabbyAPI | RTX 6000 | 44496MiB | 2/2 | 2h 14m 53s | 317/410 | 77.32% | > 73.41% | 656716 | 81.11 | 143659 | 17.74 | |
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache) | LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2 | - | 72B | EXL2 | TabbyAPI | 2x RTX 3090 | 41150MiB | 1/2 | 3h 7m 58s | 320/410 | 78.05% | > 74.88% | 656716 | 58.21 | 139499 | 12.36 | |
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache) | LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2 | - | 72B | EXL2 | TabbyAPI | 2x RTX 3090 | 41150MiB | 2/2 | 3h 5m 19s | 319/410 | 77.80% | > 74.88% | 656716 | 59.04 | 138135 | 12.42 | |
QwQ-32B-Preview (4.25bpw EXL2, max_tokens=16384) | bartowski/QwQ-32B-Preview-exl2_4_25 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 27636MiB | 1/2 | 1h 56m 8s | 319/410 | 77.80% | 0/1, 0.00% | 656716 | 94.20 | 374973 | 53.79 | |
QwQ-32B-Preview (4.25bpw EXL2, max_tokens=16384) | bartowski/QwQ-32B-Preview-exl2_4_25 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 27636MiB | 2/2 | 1h 55m 44s | 318/410 | 77.56% | 656716 | 94.45 | 377638 | 54.31 | ||
gpt-4o-2024-08-06 | - | - | - | - | OpenAI | - | - | 1/2 | 34m 54s | 320/410 | 78.05% | ~= 78.29% | 1/2, 50.00% | 631448 | 300.79 | 99103 | 47.21 |
gpt-4o-2024-08-06 | - | - | - | - | OpenAI | - | - | 2/2 | 42m 41s | 316/410 | 77.07% | ~< 78.29% | 1/3, 33.33% | 631448 | 246.02 | 98466 | 38.36 |
QwQ-32B-Preview (8.0bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 38528MiB | 1/4 | 1h 29m 49s | 324/410 | 79.02% | 0/1, 0.00% | 656716 | 121.70 | 229008 | 42.44 | |
QwQ-32B-Preview (8.0bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 38528MiB | 2/4 | 1h 32m 30s | 314/410 | 76.59% | 0/2, 0.00% | 656716 | 118.24 | 239161 | 43.06 | |
QwQ-32B-Preview (8.0bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_8_0 | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 37000MiB | 3/4 | 2h 25m 24s | 308/410 | 75.12% | 0/2, 0.00% | 656716 | 75.23 | 232208 | 26.60 | |
QwQ-32B-Preview (8.0bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_8_0 | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 37000MiB | 4/4 | 2h 27m 27s | 305/410 | 74.39% | 0/3, 0.00% | 656716 | 74.19 | 235650 | 26.62 | |
QwQ-32B-Preview-abliterated (4.5bpw EXL2, max_tokens=16384) | ibrahimkettaneh_QwQ-32B-Preview-abliterated-4.5bpw-h8-exl2 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 28556MiB | 1/2 | 2h 10m 53s | 310/410 | 75.61% | 656716 | 83.59 | 412512 | 52.51 | ||
QwQ-32B-Preview-abliterated (4.5bpw EXL2, max_tokens=16384) | ibrahimkettaneh_QwQ-32B-Preview-abliterated-4.5bpw-h8-exl2 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 28556MiB | 2/2 | 2h 25m 29s | 310/410 | 75.61% | 656716 | 75.20 | 478590 | 54.80 | ||
mistral-large-2407 (123B) | mistralai/Mistral-Large-Instruct-2407 | - | 123B | HF | Mistral | - | - | 1/2 | 40m 23s | 310/410 | 75.61% | > 70.24% | 696798 | 287.13 | 79444 | 32.74 | |
mistral-large-2407 (123B) | mistralai/Mistral-Large-Instruct-2407 | - | 123B | HF | Mistral | - | - | 2/2 | 46m 55s | 308/410 | 75.12% | > 70.24% | 0/1, 0.00% | 696798 | 247.21 | 75971 | 26.95 |
Llama-3.1-405B-Instruct-FP8 | meta-llama/Llama-3.1-405B-Instruct-FP8 | - | 405B | HF | IONOS | - | - | 1/2 | 2h 5m 28s | 311/410 | 75.85% | 648580 | 86.11 | 79191 | 10.51 | ||
Llama-3.1-405B-Instruct-FP8 | meta-llama/Llama-3.1-405B-Instruct-FP8 | - | 405B | HF | IONOS | - | - | 2/2 | 2h 10m 19s | 307/410 | 74.88% | 648580 | 82.90 | 79648 | 10.18 | ||
mistral-large-2411 (123B) | mistralai/Mistral-Large-Instruct-2411 | - | 123B | HF | Mistral | - | - | 1/2 | 41m 46s | 302/410 | 73.66% | 1/3, 33.33% | 696798 | 277.70 | 82028 | 32.69 | |
mistral-large-2411 (123B) | mistralai/Mistral-Large-Instruct-2411 | - | 123B | HF | Mistral | - | - | 2/2 | 32m 47s | 300/410 | 73.17% | 0/1, 0.00% | 696798 | 353.53 | 77998 | 39.57 | |
QwQ-32B-Preview (4.25bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_4_25 | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 26198MiB | 1/4 | 1h 39m 49s | 308/410 | 75.12% | 0/1, 0.00% | 656716 | 109.59 | 243552 | 40.64 | |
QwQ-32B-Preview (4.25bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_4_25 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 27750MiB | 2/4 | 1h 22m 12s | 304/410 | 74.15% | 656716 | 133.04 | 247314 | 50.10 | ||
QwQ-32B-Preview (4.25bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_4_25 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 27750MiB | 3/4 | 1h 21m 39s | 296/410 | 72.20% | 656716 | 133.94 | 246020 | 50.18 | ||
QwQ-32B-Preview (4.25bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_4_25 | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 26198MiB | 4/4 | 1h 42m 33s | 294/410 | 71.71% | 656716 | 106.63 | 250222 | 40.63 | ||
chatgpt-4o-latest @ 2024-11-18 | - | - | - | - | OpenAI | - | - | 1/2 | 28m 17s | 302/410 | 73.66% | < 78.29% | 2/4, 50.00% | 631448 | 371.33 | 146558 | 86.18 |
chatgpt-4o-latest @ 2024-11-18 | - | - | - | - | OpenAI | - | - | 2/2 | 28m 31s | 298/410 | 72.68% | < 78.29% | 2/2, 100.00% | 631448 | 368.19 | 146782 | 85.59 |
gpt-4o-2024-11-20 | - | - | - | - | OpenAI | - | - | 1/2 | 25m 35s | 296/410 | 72.20% | 1/7, 14.29% | 631448 | 410.38 | 158694 | 103.14 | |
gpt-4o-2024-11-20 | - | - | - | - | OpenAI | - | - | 2/2 | 26m 10s | 294/410 | 71.71% | 1/7, 14.29% | 631448 | 400.95 | 160378 | 101.84 | |
Llama-3.1-70B-Instruct | meta-llama/Llama-3.1-70B-Instruct | - | 70B | HF | IONOS | - | - | 1/2 | 41m 12s | 291/410 | 70.98% | > 66.34% | 3/12, 25.00% | 648580 | 261.88 | 102559 | 41.41 |
Llama-3.1-70B-Instruct | meta-llama/Llama-3.1-70B-Instruct | - | 70B | HF | IONOS | - | - | 2/2 | 39m 48s | 287/410 | 70.00% | > 66.34% | 3/14, 21.43% | 648580 | 271.12 | 106644 | 44.58 |
gemini-1.5-flash-002 | - | - | - | - | Gemini | - | - | 1/2 | 13m 19s | 288/410 | 70.24% | > 63.41% | 1/6, 16.67% | 648675 | 808.52 | 80535 | 100.38 |
gemini-1.5-flash-002 | - | - | - | - | Gemini | - | - | 2/2 | 22m 30s | 285/410 | 69.51% | > 63.41% | 2/7, 28.57% | 648675 | 479.42 | 80221 | 59.29 |
Llama-3.2-90B-Vision-Instruct | meta-llama/Llama-3.2-90B-Vision-Instruct | - | 90B | HF | Azure | - | - | 1/2 | 33m 6s | 289/410 | 70.49% | 4/7, 57.14% | 640380 | 321.96 | 88997 | 44.74 | |
Llama-3.2-90B-Vision-Instruct | meta-llama/Llama-3.2-90B-Vision-Instruct | - | 90B | HF | Azure | - | - | 2/2 | 31m 31s | 281/410 | 68.54% | 2/5, 40.00% | 640380 | 338.10 | 85381 | 45.08 | |
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | Qwen/Qwen2.5-Coder-3B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 45880MiB | 1/7 | 41m 59s | 289/410 | 70.49% | 656716 | 260.29 | 92126 | 36.51 | ||
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 40036MiB | 2/7 | 34m 24s | 286/410 | 69.76% | 656716 | 317.48 | 89487 | 43.26 | ||
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | Qwen/Qwen2.5-Coder-3B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 45880MiB | 3/7 | 41m 27s | 283/410 | 69.02% | 0/1, 0.00% | 656716 | 263.62 | 90349 | 36.27 | |
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0 | 32B | EXL2 | TabbyAPI | RTX 6000 | 43688MiB | 4/7 | 42m 32s | 283/410 | 69.02% | 0/1, 0.00% | 656716 | 256.77 | 90899 | 35.54 | |
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0 | 32B | EXL2 | TabbyAPI | RTX 6000 | 43688MiB | 5/7 | 44m 34s | 282/410 | 68.78% | 0/1, 0.00% | 656716 | 245.24 | 96470 | 36.03 | |
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 38620MiB | 6/7 | 1h 2m 8s | 282/410 | 68.78% | 656716 | 175.98 | 92767 | 24.86 | ||
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) | bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 40036MiB | 7/7 | 34m 56s | 280/410 | 68.29% | 656716 | 312.66 | 91926 | 43.76 | ||
QwQ-32B-Preview (3.0bpw EXL2, max_tokens=8192) | bartowski/QwQ-32B-Preview-exl2_3_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 22990MiB | 1/2 | 1h 15m 18s | 289/410 | 70.49% | 656716 | 145.23 | 269937 | 59.69 | ||
QwQ-32B-Preview (3.0bpw EXL2, max_tokens=8192) | bartowski/QwQ-32B-Preview-exl2_3_0 | Qwen/Qwen2.5-Coder-0.5B-Instruct | 32B | EXL2 | TabbyAPI | RTX 6000 | 22990MiB | 2/2 | 1h 19m 50s | 274/410 | 66.83% | 0/2, 0.00% | 656716 | 137.01 | 291818 | 60.88 | |
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2) | MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2 | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 47068MiB | 1/2 | 1h 26m 26s | 284/410 | 69.27% | 1/3, 33.33% | 696798 | 134.23 | 79925 | 15.40 | |
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2) | MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2 | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 47068MiB | 2/2 | 1h 26m 10s | 275/410 | 67.07% | 0/2, 0.00% | 696798 | 134.67 | 79778 | 15.42 | |
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2) | turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 45096MiB | 1/2 | 1h 8m 8s | 271/410 | 66.10% | < 70.24% | 696798 | 170.29 | 66670 | 16.29 | |
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2) | turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 45096MiB | 2/2 | 1h 10m 38s | 268/410 | 65.37% | < 70.24% | 1/3, 33.33% | 696798 | 164.23 | 69182 | 16.31 |
QwQ-32B-Preview (3.0bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_3_0 | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 21574MiB | 1/2 | 1h 5m 30s | 268/410 | 65.37% | 1/3, 33.33% | 656716 | 166.95 | 205218 | 52.17 | |
QwQ-32B-Preview (3.0bpw EXL2) | bartowski/QwQ-32B-Preview-exl2_3_0 | - | 32B | EXL2 | TabbyAPI | RTX 6000 | 21574MiB | 2/2 | 1h 8m 44s | 266/410 | 64.88% | 656716 | 159.10 | 215616 | 52.24 | ||
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2) | wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2 | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 45096MiB | 1/2 | 1h 11m 50s | 267/410 | 65.12% | 1/4, 25.00% | 696798 | 161.53 | 70538 | 16.35 | |
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2) | wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2 | - | 123B | EXL2 | TabbyAPI | RTX 6000 | 45096MiB | 2/2 | 1h 13m 50s | 243/410 | 59.27% | 0/4, 0.00% | 696798 | 157.18 | 72718 | 16.40 | |
mistral-small-2409 (22B) | mistralai/Mistral-Small-Instruct-2409 | - | 22B | HF | Mistral | - | - | 1/2 | 25m 3s | 243/410 | 59.27% | > 53.66% | 1/4, 25.00% | 696798 | 462.38 | 73212 | 48.58 |
mistral-small-2409 (22B) | mistralai/Mistral-Small-Instruct-2409 | - | 22B | HF | Mistral | - | - | 2/2 | 20m 45s | 239/410 | 58.29% | > 53.66% | 1/4, 25.00% | 696798 | 558.10 | 76017 | 60.89 |
- Model: Model name (with relevant parameter and setting details)
- HF Main Model Name: Full name of the tested model as listed on Hugging Face
- HF Draft Model Name (speculative decoding): Draft model used for speculative decoding (if applicable)
- Size: Parameter count
- Format: Model format type (HF, EXL2, etc.)
- API: Service provider (TabbyAPI indicates local deployment)
- GPU: Graphics card used for this benchmark run
- GPU Mem: VRAM allocated to model and configuration
- Run: Benchmark run sequence number
- Duration: Total runtime of benchmark
- Total: Number of correct answers (determines ranking!)
- %: Percentage of correct answers
- TIGER-Lab: Comparison between TIGER-Lab (the makers of MMLU-Pro)'s CS benchmark results and mine
- Correct Random Guesses: When MMLU-Pro cannot definitively identify a model's answer choice, it defaults to random guessing and reports both the number of these random guesses and their accuracy (a high proportion of random guessing indicates problems with following the response format)
- Prompt tokens: Token count of input text
- tk/s: Tokens processed per second
- Completion tokens: Token count of generated response
- tk/s: Tokens generated per second
Speculative Decoding: Turbocharging Large Language Models
Speculative decoding represents a groundbreaking acceleration technique for LLMs that follows a "generate first, verify later" approach. This method employs a smaller, faster draft model to make preliminary token predictions, which are then validated by the main model.
The process operates through parallel processing where the draft model generates multiple token predictions simultaneously. These predictions are then batch-verified by the main model, significantly reducing processing time compared to sequential token generation.
This innovative approach can accelerate text generation up to 3x while maintaining output quality. The effectiveness stems from batch processing, where the main model evaluates multiple predictions at once rather than processing individual tokens sequentially.
The system's performance heavily depends on prediction accuracy. If the acceptance rate of speculative tokens is too low, the system might actually perform slower than traditional processing. For optimal performance, the draft model should be architecturally similar to the main model to make accurate predictions, while being small enough to run quickly and fit in VRAM alongside the main model.
Think of speculative decoding as a turbocharger for AI - it significantly boosts LLM performance without compromising quality, creating a win-win situation for all LLM applications. The key is finding the right balance between draft model size and prediction accuracy - it needs to be lightweight enough to provide speed benefits while maintaining sufficient similarity to the main model for reliable predictions.
Speculative Decoding & Qwen/QwQ
I successfully increased the speed of Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) from 24.86 to 43.76 tokens per second by using Qwen2.5-Coder-0.5B-Instruct as its draft model.
And applying the same 0.5B draft model to QwQ-32B-Preview (8.0bpw EXL2) - a different model - improved its speed from 26.60 to 46.29 tokens per second.
No, Speculative Decoding doesn't improve the quality of the output, only the speed
Initially, I was surprised to see QwQ-32B-Preview (8.0bpw EXL2) achieving only 75% accuracy without speculative decoding, while reaching 79.02% with it. Magic? Not quite - just statistical variance! Speculative decoding only improves processing speed, not output quality. This became clear when a second benchmark run with identical settings yielded just 76.59% accuracy.
The real game-changer was adjusting the benchmark software's max_tokens parameter. Setting it to 16384 consistently improved accuracy - first to 79.27%, then to 79.02% in a second run. These stable results make perfect sense: with the higher token limit, responses weren't truncated prematurely, allowing for more reliable answer identification.
While benchmarks can yield surprising results, always verify any seemingly impossible findings before jumping to conclusions. The explanation could be either an anomaly or a technical error.
Speed Demons
gpt-4o-2024-11-20 and gemini-1.5-flash-002 emerge as the clear performance leaders in this benchmark, both achieving impressive speeds over 100 tokens per second. Notably, gemini-1.5-flash-002 demonstrated inconsistent performance across two test runs - reaching peak speed in one but dropping to 59.29 tokens per second in the other.
The latest GPT-4o release (2024-11-20) reveals a fascinating trade-off: While it delivers dramatically improved speed exceeding 100 tokens per second - more than doubling its predecessor's throughput - the enhanced performance comes at a cost. The model appears to be a quantized or distilled variant, resulting in lower benchmark scores compared to the previous version.
Beyond the Benchmarks
Benchmark results show us what LLMs can do right now, but they're just one piece of the puzzle. Different real-world uses will get different results, since benchmarks can't capture every aspect of how these models actually perform.
Benchmarks are useful for comparing models, but they're just a starting point. If you find a model that looks good, try it out yourself to see how well it works for your needs.
I'm really excited about QwQ and what's coming next. It looks like these might be the first local models that can actually go toe-to-toe with the big cloud-based ones.
I've used QwQ alongside Claude 3.5 Sonnet and GPT-4 for real work projects, and QwQ did better than both in some situations. I'm really looking forward to seeing a QwQ 70B version - with a stronger foundation and further refinement, QwQ's unique approach could give us Sonnet-level performance right on our own machines. Sounds too good to be true? Maybe, but we're getting there - and probably, hopefully, faster than we think!
Closing Thoughts
This deep dive into MMLU-Pro benchmarking has revealed fascinating insights about the current state of LLMs. From the impressive performance of Claude 3.5 Sonnet and Gemini Pro to the speed demons like GPT-4o-2024-11-20, each model brings its own strengths to the table.
We've seen how architectural choices like speculative decoding can dramatically improve performance without sacrificing quality, and how careful parameter tuning (like adjusting max_tokens) can significantly impact results. The trade-offs between speed and accuracy, particularly evident in the latest GPT-4o release, highlight the complex balancing act in LLM development.
But most excitingly, we're witnessing the rise of powerful local models like QwQ that can compete with or even outperform much larger models. This democratization of AI capabilities is a promising trend for the future of accessible, high-performance language models.
As we continue to push the boundaries of what's possible with LLMs, these benchmarks serve as valuable waypoints in our journey toward more capable, efficient, and accessible AI systems. The rapid pace of innovation in this field ensures that what seems impressive today may be baseline tomorrow - and I, for one, can't wait to see what comes next.
Wolfram Ravenwolf is a German AI Engineer and an internationally active consultant and renowned researcher who's particularly passionate about local language models. You can follow him on X and Bluesky, read his previous LLM tests and comparisons on HF and Reddit, check out his models on Hugging Face, tip him on Ko-fi, or book him for a consultation.