Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1126

Feature Request: change request file format to disambiguate chat and non-chat models?

#954

by CombinHorizon - opened Sep 30, 2024

Discussion

CombinHorizon

Sep 30, 2024

•

edited Sep 30, 2024

such as instead of:
ModelName-SizeB_eval_request_False_bfloat16_Original.json
perhaps,

ModelName-SizeB_eval_request_False_bfloat16_ChatOn_Original.json
ModelName-SizeB_eval_request_False_bfloat16_ChatOff_Original.json

so that they don't overwrite each other on reruns
and also the scores seem to be, (esp for some models), significantly different, depending on config
so that chat on and chat off entries are listed separately on the leaderboard

from what i'm seeing, is that for some similar models,
it seems that, chat template affects IFEval scores (⇈), & MUSR too (⇊), (but by how much?)

if this is to be updated, maybe look into the request files' commit history, also the multiple result files (which don't overwrite),
to help disambiguate & sort things out,

The chat template's, its effect on scores, seems to have a more significant impact than: bfloat16 vs float16

Question: what determines what chat-template will be used, what file or process, (e.g. generation_config.json) , what else, or other assumptions / defaults ?

alozowski

Open LLM Leaderboard org Oct 1, 2024

Hi @CombinHorizon ,

Thank you for your suggestion!

We agree that this modification can help to compare a model with and without the chat template. We're actually in the process of revamping our request naming system, as some current parameters are no longer relevant

We'll come back to you as soon as we have decided on a new simpler format!

CombinHorizon

Oct 19, 2024

Question: does the leaderboard Chat Template column - use info from the request or result file?
If it gets its data from the request file,
in the edge-case where a model

First, is submitted and completed successfully with one chat setting
After that, the model is updated,
Then, a request is submitted with the opposite chat toggle setting
And lastly, the request fails

If that were the case, wouldn't it display the wrong info in that column?! (since the request file is overwritten), It's not a major issue, but something to keep in mind, that's why referencing commit history matters, for accurate accounting.

alozowski

Open LLM Leaderboard org Oct 22, 2024

Hi @CombinHorizon ,

Yes, we use the info from request files to create a Chat Template column, and there is no such problem now, as the parser is restarted every few hours and checks the request files anew each time. Plus we don't parse failed models

CombinHorizon

Dec 24, 2024

•

edited Dec 24, 2024

about presence or absence of a chat-template, its affect on scores, (using the new UI), (14B models) (it's more readable):

could a normalized second score be used relative to similar models, with the same size/settings/class, be used (other than rank) to help compare similar models

There's also the question of chat-template settings, they affect the scores, and what is a good idea to mitigate this?, to make the models more easy to compare in detail with each other?,
e.g. a way to re-submit & view scores with the opposite chat setting, (esp. for selected well performing models)
and perhaps more advanced graphics/filtering interfaces?

see: other recent previous discussions about chat template affecting scores

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment