Question about the benchmarks

by quantflex - opened 6 days ago

6 days ago

Hi,
I'm interested in understanding the benchmarking methodology used to compare your AI models with those from other companies and teams, specifically with regards to the lm-evaluation-harness framework.

For example, I've noticed that the reported MMLU and MMLU-PRO scores for Llama-3.2-3B-Instruct and Qwen2.5-3B-Instruct appear to be displayed as lower than expected (and also lower than what is reported by Meta and Qwen themselves).

Could you provide more details on the settings or configuration used for these benchmark? I'd like to make sure that the comparisons are accurate. Thank you.

slimfrikha-tii

Technology Innovation Institute org 6 days ago

Hi,
some details are already present in the blogpost: https://huggingface.co./blog/falcon3#:~:text=In%20our%20internal%20evaluation%20pipeline%3A

quantflex

6 days ago

•

edited 6 days ago

Hi, thank you, yes I did read that prior to posting but unfortunately it only provides this one detail:

We report raw scores obtained by applying chat template without fewshot_as_multiturn (unlike Llama3.1).

Part of why I'm asking is because the official open_llm_leaderboard which is powered by the same lm-evaluation-harness is reporting these results on MMLU-PRO:

As you can see, Falcon3-3B-Instruct is slightly outscored here by both models on MMLU-PRO. However, according to your readme, the results are very different:

So, I'm just trying to understand what caused this big discrepancy between scores?

slimfrikha-tii

Technology Innovation Institute org 6 days ago

the difference is in "We report raw scores obtained by applying chat template without fewshot_as_multiturn (unlike Llama3.1)"

we use raw scores whereas HF leaderboard uses normalized scores
--fewshot_as_multiturn is not enabled in our evals whereas it is in HF evals score.

quantflex

6 days ago

Got it, so does that mean that fewshot made all the difference, because even the raw scores are showing falcon 3b as scoring slightly lower?
I'm just curious if this is an accurate reflection when, for example, in the falcon readme there's a 59.9% decrease in mmlu-pro score between Llama-3.2-3B-Instruct and Falcon3-3B-Instruct.
Here's the raw score reported by the leaderboard:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment