Feature Request: change request file format to disambiguate chat and non-chat models?

#954
by CombinHorizon - opened

such as instead of:
ModelName-SizeB_eval_request_False_bfloat16_Original.json
perhaps,

ModelName-SizeB_eval_request_False_bfloat16_ChatOn_Original.json
ModelName-SizeB_eval_request_False_bfloat16_ChatOff_Original.json

  • so that they don't overwrite each other on reruns
    and also the scores seem to be, (esp for some models), significantly different, depending on config
  • so that chat on and chat off entries are listed separately on the leaderboard

from what i'm seeing, is that for some similar models,
it seems that, chat template affects IFEval scores (โ‡ˆ), & MUSR too (โ‡Š), (but by how much?)

if this is to be updated, maybe look into the request files' commit history, also the multiple result files (which don't overwrite),
to help disambiguate & sort things out,

The chat template's, its effect on scores, seems to have a more significant impact than: bfloat16 vs float16

Question: what determines what chat-template will be used, what file or process, (e.g. generation_config.json) , what else, or other assumptions / defaults ?

Open LLM Leaderboard org

Hi @CombinHorizon ,

Thank you for your suggestion!

We agree that this modification can help to compare a model with and without the chat template. We're actually in the process of revamping our request naming system, as some current parameters are no longer relevant

We'll come back to you as soon as we have decided on a new simpler format!

Question: does the leaderboard Chat Template column - use info from the request or result file?
If it gets its data from the request file,
in the edge-case where a model

  • First, is submitted and completed successfully with one chat setting
  • After that, the model is updated,
  • Then, a request is submitted with the opposite chat toggle setting
  • And lastly, the request fails

If that were the case, wouldn't it display the wrong info in that column?! (since the request file is overwritten), It's not a major issue, but something to keep in mind, that's why referencing commit history matters, for accurate accounting.

Open LLM Leaderboard org

Hi @CombinHorizon ,

Yes, we use the info from request files to create a Chat Template column, and there is no such problem now, as the parser is restarted every few hours and checks the request files anew each time. Plus we don't parse failed models

about presence or absence of a chat-template, its affect on scores, (using the new UI), (14B models) (it's more readable):
Screenshot_2024.12.24_07-09-39.png

could a normalized second score be used relative to similar models, with the same size/settings/class, be used (other than rank) to help compare similar models

There's also the question of chat-template settings, they affect the scores, and what is a good idea to mitigate this?, to make the models more easy to compare in detail with each other?,
e.g. a way to re-submit & view scores with the opposite chat setting, (esp. for selected well performing models)
and perhaps more advanced graphics/filtering interfaces?

see: other recent previous discussions about chat template affecting scores

Sign up or log in to comment