MMLU-Pro benchmark

#13
by kth8 - opened

In Meta's announcement I noticed they showed MMLU scores for the 1B and 3B models but not MMLU-Pro. Here is my testing result with Llama 3.1 8B and Qwen2.5 for comparison:

| Models                | Data Source   | Overall | Biology | Business | Chemistry | Computer Science | Economics | Engineering | Health  | History | Law   | Math  | Philosophy | Physics | Psychology | Other |
|-----------------------|---------------|---------|---------|----------|-----------|------------------|-----------|-------------|---------|---------|-------|-------|------------|---------|------------|-------|
| Llama-3.1-8B-Instruct | TIGER-Lab     | 0.443   | 0.630   | 0.493    | 0.376     | 0.483            | 0.551     | 0.297       | 0.507   | 0.423   | 0.273 | 0.438 | 0.445      | 0.403   | 0.600      | 0.448 |
| Qwen2.5-3B            | Self-Reported | 0.437   | 0.545   | 0.541    | 0.407     | 0.432            | 0.530     | 0.292       | 0.440   | 0.391   | 0.223 | 0.545 | 0.371      | 0.440   | 0.555      | 0.415 |
| Llama-3.2-3B-Instruct | Self-Reported | 0.365   | 0.552   | 0.399    | 0.264     | 0.371            | 0.480     | 0.260       | 0.461   | 0.336   | 0.227 | 0.378 | 0.349      | 0.302   | 0.514      | 0.358 |

You can view the full leaderboard here: https://huggingface.co./spaces/TIGER-Lab/MMLU-Pro

@kth8 The Qwen2.5 series has absurdly low world knowledge considering their respective parameter counts, even compared to the Qwen2 series.

For example, Qwen2.5 72b scored 68.4/100, vs 85.9 for Qwen2 72b, on my easy world knowledge test (no esoteric questions, only top 5% most popular world knowledge, and no tricks like deliberate obfuscation, just real-world 0-shot naturally worded questions).

Anybody can filter out or under-train the bulk of humanity's most popular knowledge to focus on STEM tokens in order to boost MMLU scores at a given parameter count. This isn't something that should be encouraged.

Additionally, even when it comes to STEM knowledge Qwen2.5 (but not Qwen2) is performing well below their reported MMLU scores during real-world Q&A, so something isn't right.

I suspect this has less to do with test contamination and more to do with the difference between asking real world questions and multiple choice questions. So for example, if you under-train on a large 18T corpus you are less likely to accurately retrieve information in response to 0-shot real-world questions (resulting in a spike in hallucinations, spelling errors...), but when taking a multiple choice test like the MMLU, with one of the provided options being the correct answer, fuzzy recollection is enough to select the nearest match and get a few more questions right.

I'm >95% confident this is what's happening with Qwen2.5. It's world knowledge went way down compared to Qwen2, Llama, Mistral... at the same parameter count when asked real-world questions, and even went down with advanced STEM knowledge (more hallucinations or inaccuracies), but having more weakly held knowledge from its giant 18T under-trained corpus still allowed it to get a few more nearest match best guesses right on the MMLU to increase its score higher.

In short, the Qwen2.5 series, including Qwen2.5 3b, is profoundly ignorant compared to other LLMs of comparable sizes, including Qwen2, and to a lesser degree, even when it comes to the real world retrieval of STEM knowledge despite the higher MMLU scores.

Not that you care/asked, but I just ran my top 5% popular world knowledge test (mentioned in my last comment) on Qwen2.5 3b and it scored 28.8/100, compared to 62.1 for Llama 3.2 3b. This is an overwhelming difference. Even Qwen2 1.5 got a 25.8.

And it's more than just profoundly ignorant. It had a much harder time following instructions than Llama 3.2 3b, such as listing only the actor names, rather than both the actors and characters as instructed. But I still gave it a second chance and graded the results, which I didn't do for other LLMs, so it actually did worse than 28.8.

@phil111 Thanks for sharing your independent testing results. I agree doing alternative testing outside of the industry standard benchmarks is important to identify weaknesses in a model.

Providing 10 vs 4 options helps a lot, but you're still providing the real answer to jog the memory of LLMs and not force them to accurately retrieve the information. LLM evaluation need to transition to non-multiple choice 0-shot questions ASAP.

There's simply no way Qwen2.5 3b is scoring above or anywhere near LLMs like Mixtral 8x7b in the MMLU. Same goes for Qwen2.5 32 & 72b scoring above Gemini 1.5 Pro and Claude 3 Opus, and nearly matching GPT4o.

I'll paste 3 sample questions of how normal people try to retrieve STEM info from LLMs, and show you Mixtral 8x7b's and Qwen2.5 3b's responses to them in the subsequent comment. Again, we need to stop using multi-shot multiple choice questions to evaluate LLMs ASAP. Something is way off.

Promised STEM question examples (Mixtral 8x7b vs Qwen2.5 3b). The wording was carefully chosen, and the answer is specific (e.g. CAR vs circadian rhythm in general).

  1. In astronomy, what's the name of the hypothetical object that forms when a neutron star merges with a red supergiant star, potentially forming a black hole without a supernova explosion?

Mixtral: Thorne–Żytkow object (correct)
Qwen2.5 3b: hypernova (wrong)

2) What is the phenomenon called that makes you more alert in the morning due to the natural rise in a hormone level?

Mixtral: CAR (Cortisol Awakening Response) (correct)
Qwen2.5 3b: Circadian Rhyme (no mention of the specifically asked for CAR) (wrong)

  1. What is the condition called when blood sugar briefly falls too far after eating?

Mixtral: Reactive Hypoglycemia (correct)
Qwen2.5 3b: Hypoglycemia (wrong; completely ignores the postprandial/after eating part, the dip in blood sugar goes without saying)

Anyways, these aren't cherry picked. Qwen2.5 3b is reliably way off, or only returns a nearest match that wasn't specifically asked for, while other models like Mixtral get most of them right, and large proprietary models like GPT4o, Gemini, and Sonnet get ~100% right. The answers are all obvious to people in the know and strong LLMs, but even the larger Qwen2.5 models like 32/72b get about half of them wrong. Again, there's something fundamentally wrong with multiple choice LLM testing that's allowing for very weak models to score on par with, or even higher, than vastly more powerful models.

Sign up or log in to comment