🐺🐦‍⬛ LLM Comparison/Test: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs

Community Article Published December 4, 2024

Introduction

I've been working tirelessly on my latest research project, comparing 25 state-of-the-art large language models by running them through the respected MMLU-Pro benchmark's computer science category. This involved:

59 separate benchmark runs
Over 70 hours of total runtime
Testing 25 different LLMs, including:
- Latest models from Anthropic, Google, Alibaba, OpenAI, Mistral, Meta, and others
- Multiple model sizes (parameters and quantization)
- With and without speculative decoding (a technique that can speed up inference without compromising output quality)

The goal was to thoroughly and systematically evaluate these models to:

Determine which performs best on computer science tasks as a proxy for general intelligence
Compare open vs closed source models
Analyze the impact of model size and quantization choices
Measure the benefits of speculative decoding for inference speed
Provide a detailed analysis of the results (and surprises!)

I started this project in November and have been continuously expanding the models tested while updating my findings.

The release of QwQ particularly caught my attention, as this unique model demonstrated exceptional performance in preliminary testing, warranting deeper analysis and more extensive evaluation.

While I could continue refining and expanding this research indefinitely, I've chosen to consolidate my key findings into a focused blog post that reflects the current state of my research. This analysis represents one of the most comprehensive evaluations of large language models to date, providing valuable insights for researchers and practitioners looking to assess these models for their specific needs or implement them in real-world applications.

About the Benchmark

The MMLU-Pro benchmark is a comprehensive evaluation of large language models across various categories, including computer science, mathematics, physics, chemistry, and more. It's designed to assess a model's ability to understand and apply knowledge across a wide range of subjects, providing a robust measure of general intelligence. While it is a multiple choice test, instead of 4 answer options like in its predecessor MMLU, there are now 10 options per question, which drastically reduces the probability of correct answers by chance. Additionally, the focus is increasingly on complex reasoning tasks rather than pure factual knowledge.

For my benchmarks, I currently limit myself to the Computer Science category with its 410 questions. This pragmatic decision is based on several factors: First, I place particular emphasis on responses from my usual work environment, since I frequently use these models in this context during my daily work. Second, with local models running on consumer hardware, there are practical constraints around computation time - a single run already takes several hours with larger models, and I generally conduct at least two runs to ensure consistency.

Unlike typical benchmarks that only report single scores, I conduct multiple test runs for each model to capture performance variability. This comprehensive approach delivers a more accurate and nuanced understanding of each model's true capabilities. By executing at least two benchmark runs per model, I establish a robust assessment of both performance levels and consistency. The results feature error bars that show standard deviation, illustrating how performance varies across different test runs.

The benchmarks for this study alone required over 70 hours of runtime. With additional categories or runs, the testing duration would have become so long with the available resources that the tested models would have been outdated by the time the study was completed. Therefore, establishing practical framework conditions and boundaries is essential to achieve meaningful results within a reasonable timeframe.

Best Models

While what's best depends on the specific use case, these benchmarks offer a comprehensive overview of the current state-of-the-art in Large Language Models. Let's examine the graph at the top of this page highlighting the performance comparison among leading models:

The graph ranks models by their average score, with error bars indicating standard deviation. "Online" models are exclusively accessible through API providers such as Anthropic, Google, or OpenAI, while "Local" models can be downloaded directly from Hugging Face and run on your own hardware. The "Both" category indicates that these LLMs are available for both local deployment and through cloud API services like Azure, IONOS (a German provider especially relevant for GDPR-compliant applications requiring national cloud infrastructure), or Mistral.

Claude 3.5 Sonnet (20241022) stands out as the current top performer, which perfectly matches my hands-on experience. I've continuously used both versions of Claude 3.5 Sonnet (the original 20240620 and the updated 20241022) since their respective launches and consistently find it to be the most reliable and versatile solution across diverse applications. Based on its exceptional performance, I recommend it as the go-to model for most of my clients, provided online models are an option.
Gemini 1.5 Pro 002 demonstrates excellent performance, ranking second overall. While Google's latest experimental models reportedly achieve even better results, rate limits during benchmark testing prevented a proper evaluation of their capabilities.
QwQ 32B Preview is the best local model, surpassing many online models in performance. This is as amazing as it is surprising, as it's only a (relatively) small 32B model but outperforms all other local models in these benchmarks, including much larger 70B, 123B, or even 405B models. It even surpasses the online models from OpenAI (I could only test ChatGPT/GPT-4o) as well as the excellent Mistral models (which have always been among my personal favorites due to their outstanding multilingual capabilities).

The graph shows QwQ 32B Preview in various configurations with different settings and parameters. The 8.0bpw (8.0 bits per weight) version performs best (it's the largest available in EXL2 format), provided - and this is a major finding - the model is given enough room (max_tokens=16K) to "think"! This is QwQ's unique ability: It's capable of using chain of thought and self-reflection to arrive at the correct answer, without being specifically prompted to do so.

Consequently, QwQ performs worse in MMLU-Pro (and likely other benchmarks) if its output is truncated prematurely, which can easily happen with smaller output limits - MMLU-Pro's default is max_tokens=2K! This affects smaller quants more severely, as they aren't as intelligent as the 8.0bpw version and need to think longer (i.e., write more tokens) to arrive at the correct answer.
Athene V2 Chat is another excellent model, but it's not as stable as QwQ 32B Preview at 8-bit with max_tokens=16K. Its highest score slightly surpasses QwQ 32B Preview's, but QwQ is more consistent and has less variance, ranking higher in the graph based on the average score. It's also a 72B model, so much larger than QwQ 32B Preview.
Qwen 2.5 72B Instruct, from the same Alibaba team behind QwQ, performs exceptionally well. Even quantized down to 4.65bpw to fit my 48 GB VRAM, it outperforms most other models in these benchmarks. The Qwen team is clearly leading in open-weights models, ahead of Meta and Mistral.
GPT-4o (2024-08-06) appears lower than expected, and surprisingly, this older version performed better in the benchmark than the latest ChatGPT version or its more recent iteration (2024-11-20).
Mistral Large 2407, a 123B model, follows GPT-4o. Like GPT-4o, this older version outperformed the latest version (2411) in the benchmark. This raises questions about whether newer models are trading intelligence for better writing or speed.
Llama 3.1 405B Instruct (FP8) is the next best local model. As the largest local model, its performance falls short of expectations, especially considering the resources it requires to run.
Mistral Large 2411, the newer version, slightly trails its older counterpart. While I appreciate their models for their excellent writing and multilingual capabilities, Qwen has taken the lead, especially considering Mistral's size and research-only license.
ChatGPT-4o (latest) is the API version of the current ChatGPT website model. Its benchmark was conducted on 2024-11-18, using the version available at that time.

Online models can be updated at any time, making versioned models a more reliable choice. Even with versioned models, providers may still modify parameters like quantization, settings, and safety guardrails without notice. For maximum consistency and full control, running models locally remains the only option!
GPT-4o (2024-11-20) is the latest version of GPT-4o. Again, it's curious that a newer version shows lower benchmark performance compared to its predecessor. Looks like they traded quality for speed.
Llama 3.1 70B Instruct is the next best local model. As a 70B model, it's relatively large but still runnable locally, especially when quantized. However, this benchmark used an online, unquantized version, representing its maximum performance.
Gemini 1.5 Flash 002, Google's compact model, delivers performance that reflects its smaller size - falling short of its Pro counterpart. Nevertheless, it impressively outperforms Meta's Llama 3.2 90B, demonstrating that smaller models can achieve remarkable results.
Llama 3.2 90B Vision Instruct represents Meta's multimodal evolution of Llama - essentially an enhanced Llama 3.1 70B with integrated vision capabilities. While its performance varies slightly, it maintains comparable effectiveness to the 70B version.
Qwen Coder 32B Instruct is another outstanding model in the Qwen family, specifically optimized for coding tasks. While it shares the same 32B parameter size as QwQ 32B Preview, it scores lower on this benchmark. This difference in performance is natural, as computer science knowledge and coding capabilities are distinct skill sets - specialized models often show reduced performance in broader domains outside their focus area.
Mistral Small 2409, a 22B parameter model, ranks last behind the QwQ and Mistral Large variants in these tests. I established a minimum threshold of 50% correct answers for inclusion in this benchmark set, making it the final model to qualify for analysis.

Detailed Results

Now that we've reviewed the rankings, let's explore the detailed results and uncover additional insights. Here's the complete table:

Model	HF Main Model Name	HF Draft Model Name (speculative decoding)	Size	Format	API	GPU	GPU Mem	Run	Duration	Total	%	TIGER-Lab	Correct Random Guesses	Prompt tokens	tk/s	Completion tokens	tk/s
claude-3-5-sonnet-20241022	-	-	-	-	Anthropic	-	-	1/2	31m 50s	340/410	82.93%	~= 82.44%		694458	362.78	97438	50.90
claude-3-5-sonnet-20241022	-	-	-	-	Anthropic	-	-	2/2	31m 39s	338/410	82.44%	== 82.44%		694458	364.82	97314	51.12
gemini-1.5-pro-002	-	-	-	-	Gemini	-	-	1/2	31m 7s	335/410	81.71%	> 71.22%		648675	346.82	78311	41.87
gemini-1.5-pro-002	-	-	-	-	Gemini	-	-	2/2	30m 40s	327/410	79.76%	> 71.22%		648675	351.73	76063	41.24
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384)	bartowski/QwQ-32B-Preview-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	38436MiB	1/2	2h 3m 30s	325/410	79.27%		0/2, 0.00%	656716	88.58	327825	44.22
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384)	bartowski/QwQ-32B-Preview-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	38436MiB	2/2	2h 3m 35s	324/410	79.02%			656716	88.52	343440	46.29
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache)	wolfram/Athene-V2-Chat-4.65bpw-h6-exl2	-	72B	EXL2	TabbyAPI	RTX 6000	44496MiB	1/2	2h 13m 5s	326/410	79.51%	> 73.41%		656716	82.21	142256	17.81
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache)	wolfram/Athene-V2-Chat-4.65bpw-h6-exl2	-	72B	EXL2	TabbyAPI	RTX 6000	44496MiB	2/2	2h 14m 53s	317/410	77.32%	> 73.41%		656716	81.11	143659	17.74
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache)	LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2	-	72B	EXL2	TabbyAPI	2x RTX 3090	41150MiB	1/2	3h 7m 58s	320/410	78.05%	> 74.88%		656716	58.21	139499	12.36
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache)	LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2	-	72B	EXL2	TabbyAPI	2x RTX 3090	41150MiB	2/2	3h 5m 19s	319/410	77.80%	> 74.88%		656716	59.04	138135	12.42
QwQ-32B-Preview (4.25bpw EXL2, max_tokens=16384)	bartowski/QwQ-32B-Preview-exl2_4_25	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	27636MiB	1/2	1h 56m 8s	319/410	77.80%		0/1, 0.00%	656716	94.20	374973	53.79
QwQ-32B-Preview (4.25bpw EXL2, max_tokens=16384)	bartowski/QwQ-32B-Preview-exl2_4_25	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	27636MiB	2/2	1h 55m 44s	318/410	77.56%			656716	94.45	377638	54.31
gpt-4o-2024-08-06	-	-	-	-	OpenAI	-	-	1/2	34m 54s	320/410	78.05%	~= 78.29%	1/2, 50.00%	631448	300.79	99103	47.21
gpt-4o-2024-08-06	-	-	-	-	OpenAI	-	-	2/2	42m 41s	316/410	77.07%	~< 78.29%	1/3, 33.33%	631448	246.02	98466	38.36
QwQ-32B-Preview (8.0bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	38528MiB	1/4	1h 29m 49s	324/410	79.02%		0/1, 0.00%	656716	121.70	229008	42.44
QwQ-32B-Preview (8.0bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	38528MiB	2/4	1h 32m 30s	314/410	76.59%		0/2, 0.00%	656716	118.24	239161	43.06
QwQ-32B-Preview (8.0bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_8_0	-	32B	EXL2	TabbyAPI	RTX 6000	37000MiB	3/4	2h 25m 24s	308/410	75.12%		0/2, 0.00%	656716	75.23	232208	26.60
QwQ-32B-Preview (8.0bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_8_0	-	32B	EXL2	TabbyAPI	RTX 6000	37000MiB	4/4	2h 27m 27s	305/410	74.39%		0/3, 0.00%	656716	74.19	235650	26.62
QwQ-32B-Preview-abliterated (4.5bpw EXL2, max_tokens=16384)	ibrahimkettaneh_QwQ-32B-Preview-abliterated-4.5bpw-h8-exl2	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	28556MiB	1/2	2h 10m 53s	310/410	75.61%			656716	83.59	412512	52.51
QwQ-32B-Preview-abliterated (4.5bpw EXL2, max_tokens=16384)	ibrahimkettaneh_QwQ-32B-Preview-abliterated-4.5bpw-h8-exl2	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	28556MiB	2/2	2h 25m 29s	310/410	75.61%			656716	75.20	478590	54.80
mistral-large-2407 (123B)	mistralai/Mistral-Large-Instruct-2407	-	123B	HF	Mistral	-	-	1/2	40m 23s	310/410	75.61%	> 70.24%		696798	287.13	79444	32.74
mistral-large-2407 (123B)	mistralai/Mistral-Large-Instruct-2407	-	123B	HF	Mistral	-	-	2/2	46m 55s	308/410	75.12%	> 70.24%	0/1, 0.00%	696798	247.21	75971	26.95
Llama-3.1-405B-Instruct-FP8	meta-llama/Llama-3.1-405B-Instruct-FP8	-	405B	HF	IONOS	-	-	1/2	2h 5m 28s	311/410	75.85%			648580	86.11	79191	10.51
Llama-3.1-405B-Instruct-FP8	meta-llama/Llama-3.1-405B-Instruct-FP8	-	405B	HF	IONOS	-	-	2/2	2h 10m 19s	307/410	74.88%			648580	82.90	79648	10.18
mistral-large-2411 (123B)	mistralai/Mistral-Large-Instruct-2411	-	123B	HF	Mistral	-	-	1/2	41m 46s	302/410	73.66%		1/3, 33.33%	696798	277.70	82028	32.69
mistral-large-2411 (123B)	mistralai/Mistral-Large-Instruct-2411	-	123B	HF	Mistral	-	-	2/2	32m 47s	300/410	73.17%		0/1, 0.00%	696798	353.53	77998	39.57
QwQ-32B-Preview (4.25bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_4_25	-	32B	EXL2	TabbyAPI	RTX 6000	26198MiB	1/4	1h 39m 49s	308/410	75.12%		0/1, 0.00%	656716	109.59	243552	40.64
QwQ-32B-Preview (4.25bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_4_25	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	27750MiB	2/4	1h 22m 12s	304/410	74.15%			656716	133.04	247314	50.10
QwQ-32B-Preview (4.25bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_4_25	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	27750MiB	3/4	1h 21m 39s	296/410	72.20%			656716	133.94	246020	50.18
QwQ-32B-Preview (4.25bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_4_25	-	32B	EXL2	TabbyAPI	RTX 6000	26198MiB	4/4	1h 42m 33s	294/410	71.71%			656716	106.63	250222	40.63
chatgpt-4o-latest @ 2024-11-18	-	-	-	-	OpenAI	-	-	1/2	28m 17s	302/410	73.66%	< 78.29%	2/4, 50.00%	631448	371.33	146558	86.18
chatgpt-4o-latest @ 2024-11-18	-	-	-	-	OpenAI	-	-	2/2	28m 31s	298/410	72.68%	< 78.29%	2/2, 100.00%	631448	368.19	146782	85.59
gpt-4o-2024-11-20	-	-	-	-	OpenAI	-	-	1/2	25m 35s	296/410	72.20%		1/7, 14.29%	631448	410.38	158694	103.14
gpt-4o-2024-11-20	-	-	-	-	OpenAI	-	-	2/2	26m 10s	294/410	71.71%		1/7, 14.29%	631448	400.95	160378	101.84
Llama-3.1-70B-Instruct	meta-llama/Llama-3.1-70B-Instruct	-	70B	HF	IONOS	-	-	1/2	41m 12s	291/410	70.98%	> 66.34%	3/12, 25.00%	648580	261.88	102559	41.41
Llama-3.1-70B-Instruct	meta-llama/Llama-3.1-70B-Instruct	-	70B	HF	IONOS	-	-	2/2	39m 48s	287/410	70.00%	> 66.34%	3/14, 21.43%	648580	271.12	106644	44.58
gemini-1.5-flash-002	-	-	-	-	Gemini	-	-	1/2	13m 19s	288/410	70.24%	> 63.41%	1/6, 16.67%	648675	808.52	80535	100.38
gemini-1.5-flash-002	-	-	-	-	Gemini	-	-	2/2	22m 30s	285/410	69.51%	> 63.41%	2/7, 28.57%	648675	479.42	80221	59.29
Llama-3.2-90B-Vision-Instruct	meta-llama/Llama-3.2-90B-Vision-Instruct	-	90B	HF	Azure	-	-	1/2	33m 6s	289/410	70.49%		4/7, 57.14%	640380	321.96	88997	44.74
Llama-3.2-90B-Vision-Instruct	meta-llama/Llama-3.2-90B-Vision-Instruct	-	90B	HF	Azure	-	-	2/2	31m 31s	281/410	68.54%		2/5, 40.00%	640380	338.10	85381	45.08
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	Qwen/Qwen2.5-Coder-3B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	45880MiB	1/7	41m 59s	289/410	70.49%			656716	260.29	92126	36.51
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	40036MiB	2/7	34m 24s	286/410	69.76%			656716	317.48	89487	43.26
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	Qwen/Qwen2.5-Coder-3B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	45880MiB	3/7	41m 27s	283/410	69.02%		0/1, 0.00%	656716	263.62	90349	36.27
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0	32B	EXL2	TabbyAPI	RTX 6000	43688MiB	4/7	42m 32s	283/410	69.02%		0/1, 0.00%	656716	256.77	90899	35.54
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0	32B	EXL2	TabbyAPI	RTX 6000	43688MiB	5/7	44m 34s	282/410	68.78%		0/1, 0.00%	656716	245.24	96470	36.03
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	-	32B	EXL2	TabbyAPI	RTX 6000	38620MiB	6/7	1h 2m 8s	282/410	68.78%			656716	175.98	92767	24.86
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2)	bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	40036MiB	7/7	34m 56s	280/410	68.29%			656716	312.66	91926	43.76
QwQ-32B-Preview (3.0bpw EXL2, max_tokens=8192)	bartowski/QwQ-32B-Preview-exl2_3_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	22990MiB	1/2	1h 15m 18s	289/410	70.49%			656716	145.23	269937	59.69
QwQ-32B-Preview (3.0bpw EXL2, max_tokens=8192)	bartowski/QwQ-32B-Preview-exl2_3_0	Qwen/Qwen2.5-Coder-0.5B-Instruct	32B	EXL2	TabbyAPI	RTX 6000	22990MiB	2/2	1h 19m 50s	274/410	66.83%		0/2, 0.00%	656716	137.01	291818	60.88
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2)	MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2	-	123B	EXL2	TabbyAPI	RTX 6000	47068MiB	1/2	1h 26m 26s	284/410	69.27%		1/3, 33.33%	696798	134.23	79925	15.40
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2)	MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2	-	123B	EXL2	TabbyAPI	RTX 6000	47068MiB	2/2	1h 26m 10s	275/410	67.07%		0/2, 0.00%	696798	134.67	79778	15.42
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2)	turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw	-	123B	EXL2	TabbyAPI	RTX 6000	45096MiB	1/2	1h 8m 8s	271/410	66.10%	< 70.24%		696798	170.29	66670	16.29
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2)	turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw	-	123B	EXL2	TabbyAPI	RTX 6000	45096MiB	2/2	1h 10m 38s	268/410	65.37%	< 70.24%	1/3, 33.33%	696798	164.23	69182	16.31
QwQ-32B-Preview (3.0bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_3_0	-	32B	EXL2	TabbyAPI	RTX 6000	21574MiB	1/2	1h 5m 30s	268/410	65.37%		1/3, 33.33%	656716	166.95	205218	52.17
QwQ-32B-Preview (3.0bpw EXL2)	bartowski/QwQ-32B-Preview-exl2_3_0	-	32B	EXL2	TabbyAPI	RTX 6000	21574MiB	2/2	1h 8m 44s	266/410	64.88%			656716	159.10	215616	52.24
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2)	wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2	-	123B	EXL2	TabbyAPI	RTX 6000	45096MiB	1/2	1h 11m 50s	267/410	65.12%		1/4, 25.00%	696798	161.53	70538	16.35
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2)	wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2	-	123B	EXL2	TabbyAPI	RTX 6000	45096MiB	2/2	1h 13m 50s	243/410	59.27%		0/4, 0.00%	696798	157.18	72718	16.40
mistral-small-2409 (22B)	mistralai/Mistral-Small-Instruct-2409	-	22B	HF	Mistral	-	-	1/2	25m 3s	243/410	59.27%	> 53.66%	1/4, 25.00%	696798	462.38	73212	48.58
mistral-small-2409 (22B)	mistralai/Mistral-Small-Instruct-2409	-	22B	HF	Mistral	-	-	2/2	20m 45s	239/410	58.29%	> 53.66%	1/4, 25.00%	696798	558.10	76017	60.89

Model: Model name (with relevant parameter and setting details)
HF Main Model Name: Full name of the tested model as listed on Hugging Face
HF Draft Model Name (speculative decoding): Draft model used for speculative decoding (if applicable)
Size: Parameter count
Format: Model format type (HF, EXL2, etc.)
API: Service provider (TabbyAPI indicates local deployment)
GPU: Graphics card used for this benchmark run
GPU Mem: VRAM allocated to model and configuration
Run: Benchmark run sequence number
Duration: Total runtime of benchmark
Total: Number of correct answers (determines ranking!)
%: Percentage of correct answers
TIGER-Lab: Comparison between TIGER-Lab (the makers of MMLU-Pro)'s CS benchmark results and mine
Correct Random Guesses: When MMLU-Pro cannot definitively identify a model's answer choice, it defaults to random guessing and reports both the number of these random guesses and their accuracy (a high proportion of random guessing indicates problems with following the response format)
Prompt tokens: Token count of input text
tk/s: Tokens processed per second
Completion tokens: Token count of generated response
tk/s: Tokens generated per second

Speculative Decoding: Turbocharging Large Language Models

Speculative decoding represents a groundbreaking acceleration technique for LLMs that follows a "generate first, verify later" approach. This method employs a smaller, faster draft model to make preliminary token predictions, which are then validated by the main model.

The process operates through parallel processing where the draft model generates multiple token predictions simultaneously. These predictions are then batch-verified by the main model, significantly reducing processing time compared to sequential token generation.

This innovative approach can accelerate text generation up to 3x while maintaining output quality. The effectiveness stems from batch processing, where the main model evaluates multiple predictions at once rather than processing individual tokens sequentially.

The system's performance heavily depends on prediction accuracy. If the acceptance rate of speculative tokens is too low, the system might actually perform slower than traditional processing. For optimal performance, the draft model should be architecturally similar to the main model to make accurate predictions, while being small enough to run quickly and fit in VRAM alongside the main model.

Think of speculative decoding as a turbocharger for AI - it significantly boosts LLM performance without compromising quality, creating a win-win situation for all LLM applications. The key is finding the right balance between draft model size and prediction accuracy - it needs to be lightweight enough to provide speed benefits while maintaining sufficient similarity to the main model for reliable predictions.

Speculative Decoding & Qwen/QwQ

I successfully increased the speed of Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) from 24.86 to 43.76 tokens per second by using Qwen2.5-Coder-0.5B-Instruct as its draft model.

And applying the same 0.5B draft model to QwQ-32B-Preview (8.0bpw EXL2) - a different model - improved its speed from 26.60 to 46.29 tokens per second.

No, Speculative Decoding doesn't improve the quality of the output, only the speed

Initially, I was surprised to see QwQ-32B-Preview (8.0bpw EXL2) achieving only 75% accuracy without speculative decoding, while reaching 79.02% with it. Magic? Not quite - just statistical variance! Speculative decoding only improves processing speed, not output quality. This became clear when a second benchmark run with identical settings yielded just 76.59% accuracy.

The real game-changer was adjusting the benchmark software's max_tokens parameter. Setting it to 16384 consistently improved accuracy - first to 79.27%, then to 79.02% in a second run. These stable results make perfect sense: with the higher token limit, responses weren't truncated prematurely, allowing for more reliable answer identification.

While benchmarks can yield surprising results, always verify any seemingly impossible findings before jumping to conclusions. The explanation could be either an anomaly or a technical error.

Speed Demons

gpt-4o-2024-11-20 and gemini-1.5-flash-002 emerge as the clear performance leaders in this benchmark, both achieving impressive speeds over 100 tokens per second. Notably, gemini-1.5-flash-002 demonstrated inconsistent performance across two test runs - reaching peak speed in one but dropping to 59.29 tokens per second in the other.

The latest GPT-4o release (2024-11-20) reveals a fascinating trade-off: While it delivers dramatically improved speed exceeding 100 tokens per second - more than doubling its predecessor's throughput - the enhanced performance comes at a cost. The model appears to be a quantized or distilled variant, resulting in lower benchmark scores compared to the previous version.

Beyond the Benchmarks

Benchmark results show us what LLMs can do right now, but they're just one piece of the puzzle. Different real-world uses will get different results, since benchmarks can't capture every aspect of how these models actually perform.

Benchmarks are useful for comparing models, but they're just a starting point. If you find a model that looks good, try it out yourself to see how well it works for your needs.

I'm really excited about QwQ and what's coming next. It looks like these might be the first local models that can actually go toe-to-toe with the big cloud-based ones.

I've used QwQ alongside Claude 3.5 Sonnet and GPT-4 for real work projects, and QwQ did better than both in some situations. I'm really looking forward to seeing a QwQ 70B version - with a stronger foundation and further refinement, QwQ's unique approach could give us Sonnet-level performance right on our own machines. Sounds too good to be true? Maybe, but we're getting there - and probably, hopefully, faster than we think!

Closing Thoughts

This deep dive into MMLU-Pro benchmarking has revealed fascinating insights about the current state of LLMs. From the impressive performance of Claude 3.5 Sonnet and Gemini Pro to the speed demons like GPT-4o-2024-11-20, each model brings its own strengths to the table.

We've seen how architectural choices like speculative decoding can dramatically improve performance without sacrificing quality, and how careful parameter tuning (like adjusting max_tokens) can significantly impact results. The trade-offs between speed and accuracy, particularly evident in the latest GPT-4o release, highlight the complex balancing act in LLM development.

But most excitingly, we're witnessing the rise of powerful local models like QwQ that can compete with or even outperform much larger models. This democratization of AI capabilities is a promising trend for the future of accessible, high-performance language models.

As we continue to push the boundaries of what's possible with LLMs, these benchmarks serve as valuable waypoints in our journey toward more capable, efficient, and accessible AI systems. The rapid pace of innovation in this field ensures that what seems impressive today may be baseline tomorrow - and I, for one, can't wait to see what comes next.

Wolfram Ravenwolf is a German AI Engineer and an internationally active consultant and renowned researcher who's particularly passionate about local language models. You can follow him on X and Bluesky, read his previous LLM tests and comparisons on HF and Reddit, check out his models on Hugging Face, tip him on Ko-fi, or book him for a consultation.

Upvote