open-llm-leaderboard/open_llm_leaderboard · I can't replicate results.

Nov 12, 2024

•

edited Nov 12, 2024

My recently benchmarked model OpenChat-3.5-0106_32K-PoSE scored very badly on the leaderboard. It is just a version of Openchat-3.5-0106 with context extended using PoSE and fine-tuned. Originally I accepted that I probably messed up the fine-tuning. But after I reviewed everything, and the dataset used is so small (LongAlpaca-12k) that I just couldn't see how all scores were basically half the base model's scores. So I decided to run the same benchmarks using huggingface's fork of lm_eval locally. Due to my limited computational power and my limited time, I have only retested on leaderboard's IFEval, MMLU-Pro and MUSR and the results are very nearly identical to those from the base model, unlike the terrible results the model got here. Maybe there was some problem with the version that is/was deployed when my model was tested? Might be something to check out, maybe more models were affected. I thought about posting this as an issue on the github repo, but since the latest version there is giving me results close to what I expected, my thinking is that the problem might be on the live version.

In case it helps to narrow down the problem and version of lm_eval that was running, here are the request and result files:

Best Regards,
@Pretergeek

clefourrier

Open LLM Leaderboard org Nov 12, 2024

•

edited Nov 12, 2024

Hi @Pretergeek ,

Thanks for the issue! Did you read our FAQ? Notably because we provide the full command to use to repro our results, and it's important to compare scores with the same normalization (we have a whole page on this).
Btw, this could be a good use case for our comparator tool here: https://huggingface.co./spaces/open-llm-leaderboard/comparator. You'll be able to compare precisely your fine tune with its base :)

Pretergeek

Nov 13, 2024

Hi @clefourrier ,

I have read the FAQ, yes. I believe that is how I learned that hugginface had it's on fork of lm_eval on github. I have cloned the repository and after seeing the amount of work you folks have been putting on it lately, I though it was a good idea to try. I installed according to the FAQ instructions. As I mentioned, locally I didn't run all leaderboard tasks, but I ran the leaderboard's versions of MMLU_PRO, MUSR and IFEval. I run with some added parameters, like:

lm_eval --model hf --model_args pretrained=Pretergeek/OpenChat-3.5-0106_32K-PoSE,dtype=bfloat16,parallelize=True --batch_size auto --device cuda --log_samples --apply_chat_template --output_path OpenChat-3.5-0106_32K-PoSE_LongAlpaca/ --tasks=leaderboard_mmlu_pro

The results from my local runs are in line with the results of the base model, openchat/openchat-3.5-0106 and also in line with the results of my previous 7 models that are modifications of that same base model (although those modifications are of a different nature (they are upscaled versions of it)). I have also run the same tasks locally on the base model and found results that match the ones recorded here on the leaderboard for that model, so I believe that validates my local tests with lm_eval. That is why I found the results here on the live leaderboard unexpected. I didn't try the model comparator because although I am familiar with it and I have used it in the past, my understanding is that it does the comparison using the same result dataset that this leaderboard space uses, which are the ones I am unsure about.

Thank you in advance,
@Pretergeek

clefourrier

Open LLM Leaderboard org Nov 13, 2024

Hi!
Thanks for taking a look at the readme. You are getting different results because you are not running the same command as we are. You should be running with --apply_chat_template and --fewshot_as_multiturn, or neither.
Can you try with the correct command and tell us what you get?

Using the comparator tool would allow you to understand more in detail where the disrepancy comes from (for example comparing the openchat predictions with your model's ones).

Pretergeek

Nov 13, 2024

My bad, according to GitHub, the "--fewshot_as_multiturn" parameter was added to the doumentation (docs/source/en/open_llm_leaderboard/avout.md) on September 23, just a few days after I first cloned the repo, I should have re-read it when I made the last git pull. The install instructions on the doc are also different than the ones on github's README.md (probably those are the same as the original repo from EleutherAI). I am going to clone the repo again, reinstall using the current instructions, add the new parameter to the script and retry the evaluations.

Pretergeek changed discussion title from Possibly innacurate results. to I can't replicate results. Nov 14, 2024

Pretergeek

Nov 14, 2024

•

edited Nov 17, 2024

I just changed the title to reflect the fact that I am trying to replicate the results and because the original title sounded like there was a problem with the leaderboard while I can't say for sure the problem is not on my side.

Anyway, I made a clean install of the github repos. Follow the instructions on the documentation here.

git clone [email protected]:huggingface/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout adding_all_changess
pip install -e .[math,ifeval,sentencepiece]
lm-eval --model_args="pretrained=<your_model>,revision=<your_model_revision>,dtype=<model_dtype>" --tasks=leaderboard  --batch_size=auto --output_path=<output_path>

Unfortunately that commit doesn't seem to work with parallelize=True nor accelerate launch and my humble GPUs can't do the work solo. I did create an accelerate config, the same I always use locally and I still get this exception:

Traceback (most recent call last):
  File "/home/duda/repos/lm-evaluation-harness/venv/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/home/duda/repos/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "/home/duda/repos/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
  File "/home/duda/repos/lm-evaluation-harness/lm_eval/evaluator.py", line 198, in simple_evaluate
    lm = lm_eval.api.registry.get_model(model).create_from_arg_string(
  File "/home/duda/repos/lm-evaluation-harness/lm_eval/api/model.py", line 148, in create_from_arg_string
    return cls(**args, **args2)
  File "/home/duda/repos/lm-evaluation-harness/lm_eval/models/huggingface.py", line 225, in __init__
    self._create_model(
  File "/home/duda/repos/lm-evaluation-harness/lm_eval/models/huggingface.py", line 634, in _create_model
    self._get_accelerate_args(
  File "/home/duda/repos/lm-evaluation-harness/lm_eval/models/huggingface.py", line 376, in _get_accelerate_args
    if gpus > 1 and num_machines == 0:
TypeError: '>' not supported between instances of 'NoneType' and 'int'

It seems I have null GPUs. :)

I could change to the last commit, but that is the one I used for the previous tests that might be incorrect, and that commit seems to be marked as failure on github:

So I think I am going to wait for a new stable release before trying to replicate the evaluations. I am a bit limited on time and resources.

Thank you for all the help trying to help me diagnose the problem. I appreciate you taking your time for that.
If it is possible and we can leave this discussion open for a while until I am able to redo the evaluations, or just in case someone else find discrepancies on the results, I would appreciate.

Best Regards,
@Pretergeek