πŸΊπŸ¦β€β¬› LLM Comparison/Test: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs

Community Article Published December 4, 2024

image/png

Introduction

I've been working tirelessly on my latest research project, comparing 25 state-of-the-art large language models by running them through the respected MMLU-Pro benchmark's computer science category. This involved:

  • 59 separate benchmark runs
  • Over 70 hours of total runtime
  • Testing 25 different LLMs, including:
    • Latest models from Anthropic, Google, Alibaba, OpenAI, Mistral, Meta, and others
    • Multiple model sizes (parameters and quantization)
    • With and without speculative decoding (a technique that can speed up inference without compromising output quality)

The goal was to thoroughly and systematically evaluate these models to:

  1. Determine which performs best on computer science tasks as a proxy for general intelligence
  2. Compare open vs closed source models
  3. Analyze the impact of model size and quantization choices
  4. Measure the benefits of speculative decoding for inference speed
  5. Provide a detailed analysis of the results (and surprises!)

I started this project in November and have been continuously expanding the models tested while updating my findings.

The release of QwQ particularly caught my attention, as this unique model demonstrated exceptional performance in preliminary testing, warranting deeper analysis and more extensive evaluation.

While I could continue refining and expanding this research indefinitely, I've chosen to consolidate my key findings into a focused blog post that reflects the current state of my research. This analysis represents one of the most comprehensive evaluations of large language models to date, providing valuable insights for researchers and practitioners looking to assess these models for their specific needs or implement them in real-world applications.

About the Benchmark

The MMLU-Pro benchmark is a comprehensive evaluation of large language models across various categories, including computer science, mathematics, physics, chemistry, and more. It's designed to assess a model's ability to understand and apply knowledge across a wide range of subjects, providing a robust measure of general intelligence. While it is a multiple choice test, instead of 4 answer options like in its predecessor MMLU, there are now 10 options per question, which drastically reduces the probability of correct answers by chance. Additionally, the focus is increasingly on complex reasoning tasks rather than pure factual knowledge.

For my benchmarks, I currently limit myself to the Computer Science category with its 410 questions. This pragmatic decision is based on several factors: First, I place particular emphasis on responses from my usual work environment, since I frequently use these models in this context during my daily work. Second, with local models running on consumer hardware, there are practical constraints around computation time - a single run already takes several hours with larger models, and I generally conduct at least two runs to ensure consistency.

Unlike typical benchmarks that only report single scores, I conduct multiple test runs for each model to capture performance variability. This comprehensive approach delivers a more accurate and nuanced understanding of each model's true capabilities. By executing at least two benchmark runs per model, I establish a robust assessment of both performance levels and consistency. The results feature error bars that show standard deviation, illustrating how performance varies across different test runs.

The benchmarks for this study alone required over 70 hours of runtime. With additional categories or runs, the testing duration would have become so long with the available resources that the tested models would have been outdated by the time the study was completed. Therefore, establishing practical framework conditions and boundaries is essential to achieve meaningful results within a reasonable timeframe.

Best Models

While what's best depends on the specific use case, these benchmarks offer a comprehensive overview of the current state-of-the-art in Large Language Models. Let's examine the graph at the top of this page highlighting the performance comparison among leading models:

The graph ranks models by their average score, with error bars indicating standard deviation. "Online" models are exclusively accessible through API providers such as Anthropic, Google, or OpenAI, while "Local" models can be downloaded directly from Hugging Face and run on your own hardware. The "Both" category indicates that these LLMs are available for both local deployment and through cloud API services like Azure, IONOS (a German provider especially relevant for GDPR-compliant applications requiring national cloud infrastructure), or Mistral.

  1. Claude 3.5 Sonnet (20241022) stands out as the current top performer, which perfectly matches my hands-on experience. I've continuously used both versions of Claude 3.5 Sonnet (the original 20240620 and the updated 20241022) since their respective launches and consistently find it to be the most reliable and versatile solution across diverse applications. Based on its exceptional performance, I recommend it as the go-to model for most of my clients, provided online models are an option.

  2. Gemini 1.5 Pro 002 demonstrates excellent performance, ranking second overall. While Google's latest experimental models reportedly achieve even better results, rate limits during benchmark testing prevented a proper evaluation of their capabilities.

  3. QwQ 32B Preview is the best local model, surpassing many online models in performance. This is as amazing as it is surprising, as it's only a (relatively) small 32B model but outperforms all other local models in these benchmarks, including much larger 70B, 123B, or even 405B models. It even surpasses the online models from OpenAI (I could only test ChatGPT/GPT-4o) as well as the excellent Mistral models (which have always been among my personal favorites due to their outstanding multilingual capabilities).

    The graph shows QwQ 32B Preview in various configurations with different settings and parameters. The 8.0bpw (8.0 bits per weight) version performs best (it's the largest available in EXL2 format), provided - and this is a major finding - the model is given enough room (max_tokens=16K) to "think"! This is QwQ's unique ability: It's capable of using chain of thought and self-reflection to arrive at the correct answer, without being specifically prompted to do so.

    Consequently, QwQ performs worse in MMLU-Pro (and likely other benchmarks) if its output is truncated prematurely, which can easily happen with smaller output limits - MMLU-Pro's default is max_tokens=2K! This affects smaller quants more severely, as they aren't as intelligent as the 8.0bpw version and need to think longer (i.e., write more tokens) to arrive at the correct answer.

  4. Athene V2 Chat is another excellent model, but it's not as stable as QwQ 32B Preview at 8-bit with max_tokens=16K. Its highest score slightly surpasses QwQ 32B Preview's, but QwQ is more consistent and has less variance, ranking higher in the graph based on the average score. It's also a 72B model, so much larger than QwQ 32B Preview.

  5. Qwen 2.5 72B Instruct, from the same Alibaba team behind QwQ, performs exceptionally well. Even quantized down to 4.65bpw to fit my 48 GB VRAM, it outperforms most other models in these benchmarks. The Qwen team is clearly leading in open-weights models, ahead of Meta and Mistral.

  6. GPT-4o (2024-08-06) appears lower than expected, and surprisingly, this older version performed better in the benchmark than the latest ChatGPT version or its more recent iteration (2024-11-20).

  7. Mistral Large 2407, a 123B model, follows GPT-4o. Like GPT-4o, this older version outperformed the latest version (2411) in the benchmark. This raises questions about whether newer models are trading intelligence for better writing or speed.

  8. Llama 3.1 405B Instruct (FP8) is the next best local model. As the largest local model, its performance falls short of expectations, especially considering the resources it requires to run.

  9. Mistral Large 2411, the newer version, slightly trails its older counterpart. While I appreciate their models for their excellent writing and multilingual capabilities, Qwen has taken the lead, especially considering Mistral's size and research-only license.

  10. ChatGPT-4o (latest) is the API version of the current ChatGPT website model. Its benchmark was conducted on 2024-11-18, using the version available at that time.

    Online models can be updated at any time, making versioned models a more reliable choice. Even with versioned models, providers may still modify parameters like quantization, settings, and safety guardrails without notice. For maximum consistency and full control, running models locally remains the only option!

  11. GPT-4o (2024-11-20) is the latest version of GPT-4o. Again, it's curious that a newer version shows lower benchmark performance compared to its predecessor. Looks like they traded quality for speed.

  12. Llama 3.1 70B Instruct is the next best local model. As a 70B model, it's relatively large but still runnable locally, especially when quantized. However, this benchmark used an online, unquantized version, representing its maximum performance.

  13. Gemini 1.5 Flash 002, Google's compact model, delivers performance that reflects its smaller size - falling short of its Pro counterpart. Nevertheless, it impressively outperforms Meta's Llama 3.2 90B, demonstrating that smaller models can achieve remarkable results.

  14. Llama 3.2 90B Vision Instruct represents Meta's multimodal evolution of Llama - essentially an enhanced Llama 3.1 70B with integrated vision capabilities. While its performance varies slightly, it maintains comparable effectiveness to the 70B version.

  15. Qwen Coder 32B Instruct is another outstanding model in the Qwen family, specifically optimized for coding tasks. While it shares the same 32B parameter size as QwQ 32B Preview, it scores lower on this benchmark. This difference in performance is natural, as computer science knowledge and coding capabilities are distinct skill sets - specialized models often show reduced performance in broader domains outside their focus area.

  16. Mistral Small 2409, a 22B parameter model, ranks last behind the QwQ and Mistral Large variants in these tests. I established a minimum threshold of 50% correct answers for inclusion in this benchmark set, making it the final model to qualify for analysis.

Detailed Results

Now that we've reviewed the rankings, let's explore the detailed results and uncover additional insights. Here's the complete table:

Model HF Main Model Name HF Draft Model Name (speculative decoding) Size Format API GPU GPU Mem Run Duration Total % TIGER-Lab Correct Random Guesses Prompt tokens tk/s Completion tokens tk/s
claude-3-5-sonnet-20241022 - - - - Anthropic - - 1/2 31m 50s 340/410 82.93% ~= 82.44% 694458 362.78 97438 50.90
claude-3-5-sonnet-20241022 - - - - Anthropic - - 2/2 31m 39s 338/410 82.44% == 82.44% 694458 364.82 97314 51.12
gemini-1.5-pro-002 - - - - Gemini - - 1/2 31m 7s 335/410 81.71% > 71.22% 648675 346.82 78311 41.87
gemini-1.5-pro-002 - - - - Gemini - - 2/2 30m 40s 327/410 79.76% > 71.22% 648675 351.73 76063 41.24
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384) bartowski/QwQ-32B-Preview-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 38436MiB 1/2 2h 3m 30s 325/410 79.27% 0/2, 0.00% 656716 88.58 327825 44.22
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384) bartowski/QwQ-32B-Preview-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 38436MiB 2/2 2h 3m 35s 324/410 79.02% 656716 88.52 343440 46.29
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache) wolfram/Athene-V2-Chat-4.65bpw-h6-exl2 - 72B EXL2 TabbyAPI RTX 6000 44496MiB 1/2 2h 13m 5s 326/410 79.51% > 73.41% 656716 82.21 142256 17.81
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache) wolfram/Athene-V2-Chat-4.65bpw-h6-exl2 - 72B EXL2 TabbyAPI RTX 6000 44496MiB 2/2 2h 14m 53s 317/410 77.32% > 73.41% 656716 81.11 143659 17.74
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache) LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2 - 72B EXL2 TabbyAPI 2x RTX 3090 41150MiB 1/2 3h 7m 58s 320/410 78.05% > 74.88% 656716 58.21 139499 12.36
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache) LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2 - 72B EXL2 TabbyAPI 2x RTX 3090 41150MiB 2/2 3h 5m 19s 319/410 77.80% > 74.88% 656716 59.04 138135 12.42
QwQ-32B-Preview (4.25bpw EXL2, max_tokens=16384) bartowski/QwQ-32B-Preview-exl2_4_25 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 27636MiB 1/2 1h 56m 8s 319/410 77.80% 0/1, 0.00% 656716 94.20 374973 53.79
QwQ-32B-Preview (4.25bpw EXL2, max_tokens=16384) bartowski/QwQ-32B-Preview-exl2_4_25 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 27636MiB 2/2 1h 55m 44s 318/410 77.56% 656716 94.45 377638 54.31
gpt-4o-2024-08-06 - - - - OpenAI - - 1/2 34m 54s 320/410 78.05% ~= 78.29% 1/2, 50.00% 631448 300.79 99103 47.21
gpt-4o-2024-08-06 - - - - OpenAI - - 2/2 42m 41s 316/410 77.07% ~< 78.29% 1/3, 33.33% 631448 246.02 98466 38.36
QwQ-32B-Preview (8.0bpw EXL2) bartowski/QwQ-32B-Preview-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 38528MiB 1/4 1h 29m 49s 324/410 79.02% 0/1, 0.00% 656716 121.70 229008 42.44
QwQ-32B-Preview (8.0bpw EXL2) bartowski/QwQ-32B-Preview-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 38528MiB 2/4 1h 32m 30s 314/410 76.59% 0/2, 0.00% 656716 118.24 239161 43.06
QwQ-32B-Preview (8.0bpw EXL2) bartowski/QwQ-32B-Preview-exl2_8_0 - 32B EXL2 TabbyAPI RTX 6000 37000MiB 3/4 2h 25m 24s 308/410 75.12% 0/2, 0.00% 656716 75.23 232208 26.60
QwQ-32B-Preview (8.0bpw EXL2) bartowski/QwQ-32B-Preview-exl2_8_0 - 32B EXL2 TabbyAPI RTX 6000 37000MiB 4/4 2h 27m 27s 305/410 74.39% 0/3, 0.00% 656716 74.19 235650 26.62
QwQ-32B-Preview-abliterated (4.5bpw EXL2, max_tokens=16384) ibrahimkettaneh_QwQ-32B-Preview-abliterated-4.5bpw-h8-exl2 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 28556MiB 1/2 2h 10m 53s 310/410 75.61% 656716 83.59 412512 52.51
QwQ-32B-Preview-abliterated (4.5bpw EXL2, max_tokens=16384) ibrahimkettaneh_QwQ-32B-Preview-abliterated-4.5bpw-h8-exl2 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 28556MiB 2/2 2h 25m 29s 310/410 75.61% 656716 75.20 478590 54.80
mistral-large-2407 (123B) mistralai/Mistral-Large-Instruct-2407 - 123B HF Mistral - - 1/2 40m 23s 310/410 75.61% > 70.24% 696798 287.13 79444 32.74
mistral-large-2407 (123B) mistralai/Mistral-Large-Instruct-2407 - 123B HF Mistral - - 2/2 46m 55s 308/410 75.12% > 70.24% 0/1, 0.00% 696798 247.21 75971 26.95
Llama-3.1-405B-Instruct-FP8 meta-llama/Llama-3.1-405B-Instruct-FP8 - 405B HF IONOS - - 1/2 2h 5m 28s 311/410 75.85% 648580 86.11 79191 10.51
Llama-3.1-405B-Instruct-FP8 meta-llama/Llama-3.1-405B-Instruct-FP8 - 405B HF IONOS - - 2/2 2h 10m 19s 307/410 74.88% 648580 82.90 79648 10.18
mistral-large-2411 (123B) mistralai/Mistral-Large-Instruct-2411 - 123B HF Mistral - - 1/2 41m 46s 302/410 73.66% 1/3, 33.33% 696798 277.70 82028 32.69
mistral-large-2411 (123B) mistralai/Mistral-Large-Instruct-2411 - 123B HF Mistral - - 2/2 32m 47s 300/410 73.17% 0/1, 0.00% 696798 353.53 77998 39.57
QwQ-32B-Preview (4.25bpw EXL2) bartowski/QwQ-32B-Preview-exl2_4_25 - 32B EXL2 TabbyAPI RTX 6000 26198MiB 1/4 1h 39m 49s 308/410 75.12% 0/1, 0.00% 656716 109.59 243552 40.64
QwQ-32B-Preview (4.25bpw EXL2) bartowski/QwQ-32B-Preview-exl2_4_25 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 27750MiB 2/4 1h 22m 12s 304/410 74.15% 656716 133.04 247314 50.10
QwQ-32B-Preview (4.25bpw EXL2) bartowski/QwQ-32B-Preview-exl2_4_25 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 27750MiB 3/4 1h 21m 39s 296/410 72.20% 656716 133.94 246020 50.18
QwQ-32B-Preview (4.25bpw EXL2) bartowski/QwQ-32B-Preview-exl2_4_25 - 32B EXL2 TabbyAPI RTX 6000 26198MiB 4/4 1h 42m 33s 294/410 71.71% 656716 106.63 250222 40.63
chatgpt-4o-latest @ 2024-11-18 - - - - OpenAI - - 1/2 28m 17s 302/410 73.66% < 78.29% 2/4, 50.00% 631448 371.33 146558 86.18
chatgpt-4o-latest @ 2024-11-18 - - - - OpenAI - - 2/2 28m 31s 298/410 72.68% < 78.29% 2/2, 100.00% 631448 368.19 146782 85.59
gpt-4o-2024-11-20 - - - - OpenAI - - 1/2 25m 35s 296/410 72.20% 1/7, 14.29% 631448 410.38 158694 103.14
gpt-4o-2024-11-20 - - - - OpenAI - - 2/2 26m 10s 294/410 71.71% 1/7, 14.29% 631448 400.95 160378 101.84
Llama-3.1-70B-Instruct meta-llama/Llama-3.1-70B-Instruct - 70B HF IONOS - - 1/2 41m 12s 291/410 70.98% > 66.34% 3/12, 25.00% 648580 261.88 102559 41.41
Llama-3.1-70B-Instruct meta-llama/Llama-3.1-70B-Instruct - 70B HF IONOS - - 2/2 39m 48s 287/410 70.00% > 66.34% 3/14, 21.43% 648580 271.12 106644 44.58
gemini-1.5-flash-002 - - - - Gemini - - 1/2 13m 19s 288/410 70.24% > 63.41% 1/6, 16.67% 648675 808.52 80535 100.38
gemini-1.5-flash-002 - - - - Gemini - - 2/2 22m 30s 285/410 69.51% > 63.41% 2/7, 28.57% 648675 479.42 80221 59.29
Llama-3.2-90B-Vision-Instruct meta-llama/Llama-3.2-90B-Vision-Instruct - 90B HF Azure - - 1/2 33m 6s 289/410 70.49% 4/7, 57.14% 640380 321.96 88997 44.74
Llama-3.2-90B-Vision-Instruct meta-llama/Llama-3.2-90B-Vision-Instruct - 90B HF Azure - - 2/2 31m 31s 281/410 68.54% 2/5, 40.00% 640380 338.10 85381 45.08
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 Qwen/Qwen2.5-Coder-3B-Instruct 32B EXL2 TabbyAPI RTX 6000 45880MiB 1/7 41m 59s 289/410 70.49% 656716 260.29 92126 36.51
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 40036MiB 2/7 34m 24s 286/410 69.76% 656716 317.48 89487 43.26
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 Qwen/Qwen2.5-Coder-3B-Instruct 32B EXL2 TabbyAPI RTX 6000 45880MiB 3/7 41m 27s 283/410 69.02% 0/1, 0.00% 656716 263.62 90349 36.27
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0 32B EXL2 TabbyAPI RTX 6000 43688MiB 4/7 42m 32s 283/410 69.02% 0/1, 0.00% 656716 256.77 90899 35.54
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0 32B EXL2 TabbyAPI RTX 6000 43688MiB 5/7 44m 34s 282/410 68.78% 0/1, 0.00% 656716 245.24 96470 36.03
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 - 32B EXL2 TabbyAPI RTX 6000 38620MiB 6/7 1h 2m 8s 282/410 68.78% 656716 175.98 92767 24.86
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 40036MiB 7/7 34m 56s 280/410 68.29% 656716 312.66 91926 43.76
QwQ-32B-Preview (3.0bpw EXL2, max_tokens=8192) bartowski/QwQ-32B-Preview-exl2_3_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 22990MiB 1/2 1h 15m 18s 289/410 70.49% 656716 145.23 269937 59.69
QwQ-32B-Preview (3.0bpw EXL2, max_tokens=8192) bartowski/QwQ-32B-Preview-exl2_3_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 22990MiB 2/2 1h 19m 50s 274/410 66.83% 0/2, 0.00% 656716 137.01 291818 60.88
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2) MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2 - 123B EXL2 TabbyAPI RTX 6000 47068MiB 1/2 1h 26m 26s 284/410 69.27% 1/3, 33.33% 696798 134.23 79925 15.40
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2) MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2 - 123B EXL2 TabbyAPI RTX 6000 47068MiB 2/2 1h 26m 10s 275/410 67.07% 0/2, 0.00% 696798 134.67 79778 15.42
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2) turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw - 123B EXL2 TabbyAPI RTX 6000 45096MiB 1/2 1h 8m 8s 271/410 66.10% < 70.24% 696798 170.29 66670 16.29
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2) turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw - 123B EXL2 TabbyAPI RTX 6000 45096MiB 2/2 1h 10m 38s 268/410 65.37% < 70.24% 1/3, 33.33% 696798 164.23 69182 16.31
QwQ-32B-Preview (3.0bpw EXL2) bartowski/QwQ-32B-Preview-exl2_3_0 - 32B EXL2 TabbyAPI RTX 6000 21574MiB 1/2 1h 5m 30s 268/410 65.37% 1/3, 33.33% 656716 166.95 205218 52.17
QwQ-32B-Preview (3.0bpw EXL2) bartowski/QwQ-32B-Preview-exl2_3_0 - 32B EXL2 TabbyAPI RTX 6000 21574MiB 2/2 1h 8m 44s 266/410 64.88% 656716 159.10 215616 52.24
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2) wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2 - 123B EXL2 TabbyAPI RTX 6000 45096MiB 1/2 1h 11m 50s 267/410 65.12% 1/4, 25.00% 696798 161.53 70538 16.35
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2) wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2 - 123B EXL2 TabbyAPI RTX 6000 45096MiB 2/2 1h 13m 50s 243/410 59.27% 0/4, 0.00% 696798 157.18 72718 16.40
mistral-small-2409 (22B) mistralai/Mistral-Small-Instruct-2409 - 22B HF Mistral - - 1/2 25m 3s 243/410 59.27% > 53.66% 1/4, 25.00% 696798 462.38 73212 48.58
mistral-small-2409 (22B) mistralai/Mistral-Small-Instruct-2409 - 22B HF Mistral - - 2/2 20m 45s 239/410 58.29% > 53.66% 1/4, 25.00% 696798 558.10 76017 60.89
  • Model: Model name (with relevant parameter and setting details)
  • HF Main Model Name: Full name of the tested model as listed on Hugging Face
  • HF Draft Model Name (speculative decoding): Draft model used for speculative decoding (if applicable)
  • Size: Parameter count
  • Format: Model format type (HF, EXL2, etc.)
  • API: Service provider (TabbyAPI indicates local deployment)
  • GPU: Graphics card used for this benchmark run
  • GPU Mem: VRAM allocated to model and configuration
  • Run: Benchmark run sequence number
  • Duration: Total runtime of benchmark
  • Total: Number of correct answers (determines ranking!)
  • %: Percentage of correct answers
  • TIGER-Lab: Comparison between TIGER-Lab (the makers of MMLU-Pro)'s CS benchmark results and mine
  • Correct Random Guesses: When MMLU-Pro cannot definitively identify a model's answer choice, it defaults to random guessing and reports both the number of these random guesses and their accuracy (a high proportion of random guessing indicates problems with following the response format)
  • Prompt tokens: Token count of input text
  • tk/s: Tokens processed per second
  • Completion tokens: Token count of generated response
  • tk/s: Tokens generated per second

Speculative Decoding: Turbocharging Large Language Models

Speculative decoding represents a groundbreaking acceleration technique for LLMs that follows a "generate first, verify later" approach. This method employs a smaller, faster draft model to make preliminary token predictions, which are then validated by the main model.

The process operates through parallel processing where the draft model generates multiple token predictions simultaneously. These predictions are then batch-verified by the main model, significantly reducing processing time compared to sequential token generation.

This innovative approach can accelerate text generation up to 3x while maintaining output quality. The effectiveness stems from batch processing, where the main model evaluates multiple predictions at once rather than processing individual tokens sequentially.

The system's performance heavily depends on prediction accuracy. If the acceptance rate of speculative tokens is too low, the system might actually perform slower than traditional processing. For optimal performance, the draft model should be architecturally similar to the main model to make accurate predictions, while being small enough to run quickly and fit in VRAM alongside the main model.

Think of speculative decoding as a turbocharger for AI - it significantly boosts LLM performance without compromising quality, creating a win-win situation for all LLM applications. The key is finding the right balance between draft model size and prediction accuracy - it needs to be lightweight enough to provide speed benefits while maintaining sufficient similarity to the main model for reliable predictions.

Speculative Decoding & Qwen/QwQ

I successfully increased the speed of Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) from 24.86 to 43.76 tokens per second by using Qwen2.5-Coder-0.5B-Instruct as its draft model.

And applying the same 0.5B draft model to QwQ-32B-Preview (8.0bpw EXL2) - a different model - improved its speed from 26.60 to 46.29 tokens per second.

No, Speculative Decoding doesn't improve the quality of the output, only the speed

Initially, I was surprised to see QwQ-32B-Preview (8.0bpw EXL2) achieving only 75% accuracy without speculative decoding, while reaching 79.02% with it. Magic? Not quite - just statistical variance! Speculative decoding only improves processing speed, not output quality. This became clear when a second benchmark run with identical settings yielded just 76.59% accuracy.

The real game-changer was adjusting the benchmark software's max_tokens parameter. Setting it to 16384 consistently improved accuracy - first to 79.27%, then to 79.02% in a second run. These stable results make perfect sense: with the higher token limit, responses weren't truncated prematurely, allowing for more reliable answer identification.

While benchmarks can yield surprising results, always verify any seemingly impossible findings before jumping to conclusions. The explanation could be either an anomaly or a technical error.

Speed Demons

gpt-4o-2024-11-20 and gemini-1.5-flash-002 emerge as the clear performance leaders in this benchmark, both achieving impressive speeds over 100 tokens per second. Notably, gemini-1.5-flash-002 demonstrated inconsistent performance across two test runs - reaching peak speed in one but dropping to 59.29 tokens per second in the other.

The latest GPT-4o release (2024-11-20) reveals a fascinating trade-off: While it delivers dramatically improved speed exceeding 100 tokens per second - more than doubling its predecessor's throughput - the enhanced performance comes at a cost. The model appears to be a quantized or distilled variant, resulting in lower benchmark scores compared to the previous version.

Beyond the Benchmarks

Benchmark results show us what LLMs can do right now, but they're just one piece of the puzzle. Different real-world uses will get different results, since benchmarks can't capture every aspect of how these models actually perform.

Benchmarks are useful for comparing models, but they're just a starting point. If you find a model that looks good, try it out yourself to see how well it works for your needs.

I'm really excited about QwQ and what's coming next. It looks like these might be the first local models that can actually go toe-to-toe with the big cloud-based ones.

I've used QwQ alongside Claude 3.5 Sonnet and GPT-4 for real work projects, and QwQ did better than both in some situations. I'm really looking forward to seeing a QwQ 70B version - with a stronger foundation and further refinement, QwQ's unique approach could give us Sonnet-level performance right on our own machines. Sounds too good to be true? Maybe, but we're getting there - and probably, hopefully, faster than we think!

Closing Thoughts

This deep dive into MMLU-Pro benchmarking has revealed fascinating insights about the current state of LLMs. From the impressive performance of Claude 3.5 Sonnet and Gemini Pro to the speed demons like GPT-4o-2024-11-20, each model brings its own strengths to the table.

We've seen how architectural choices like speculative decoding can dramatically improve performance without sacrificing quality, and how careful parameter tuning (like adjusting max_tokens) can significantly impact results. The trade-offs between speed and accuracy, particularly evident in the latest GPT-4o release, highlight the complex balancing act in LLM development.

But most excitingly, we're witnessing the rise of powerful local models like QwQ that can compete with or even outperform much larger models. This democratization of AI capabilities is a promising trend for the future of accessible, high-performance language models.

As we continue to push the boundaries of what's possible with LLMs, these benchmarks serve as valuable waypoints in our journey toward more capable, efficient, and accessible AI systems. The rapid pace of innovation in this field ensures that what seems impressive today may be baseline tomorrow - and I, for one, can't wait to see what comes next.

Wolfram Ravenwolf is a German AI Engineer and an internationally active consultant and renowned researcher who's particularly passionate about local language models. You can follow him on X and Bluesky, read his previous LLM tests and comparisons on HF and Reddit, check out his models on Hugging Face, tip him on Ko-fi, or book him for a consultation.