i have proof that show the evals shouldn't be trusted

#116
by breadlicker45 - opened

seen here is my models, bread ai. my models are around 160-200M parameters.

PM_modelv2 (a badly fine-tuned prompt making model trained on a m40 gpu i own) beats a 60B model, that is sooo false

musePY basically a random number generator
image.png

breadlicker45 changed discussion title from i have proof that show the evals are wrong and shouldn to i have proof that show the evals shouldn't be trusted

Lol..Also ehartford/WizardLM-30B-Uncensored has about 60 points and WizardLM/WizardLM-30B-V1.0 has only 30. I know the two models are different but seems not quite possible to have such huge performance difference.
Several weeks ago the HF team did said they were rewriting code to fix previous eval errors or something.

Open LLM Leaderboard org

Hi @zmcmcc and @breadlicker45 !
We have re-run all models to use the new fixed MMLU evals from the Harness, and are currently re-running some Llama scores.

Did you know that you can actually reproduce our results by launching the commands on the About section? If there are models you feel unsure about, feel free to double check them by re-running evals and giving us your results!

Hi @zmcmcc and @breadlicker45 !
We have re-run all models to use the new fixed MMLU evals from the Harness, and are currently re-running some Llama scores.

Did you know that you can actually reproduce our results by launching the commands on the About section? If there are models you feel unsure about, feel free to double check them by re-running evals and giving us your results!

Sure run musepy and you will see just by using your brain it is wrong, and use your brain not evals and you can see it is wrong

Lol..Also ehartford/WizardLM-30B-Uncensored has about 60 points and WizardLM/WizardLM-30B-V1.0 has only 30. I know the two models are different but seems not quite possible to have such huge performance difference.
Several weeks ago the HF team did said they were rewriting code to fix previous eval errors or something.

The backend is https://github.com/EleutherAI/lm-evaluation-harness Confirmed by Stella

Open LLM Leaderboard org
edited Jul 19, 2023

Hi!
Just checked your model's results (BreadAI/PM_model_V2) more in depth, and you actually get random scores for all evals (around 25% = random baseline) except for TruthfulQA, which has a slightly unbalanced answer distribution - I suspect your random generator got lucky here!
A good way of pointing out that evals, especially for scores so low, do not tell the whole story :)

clefourrier changed discussion status to closed

Sign up or log in to comment