Spaces:
Running
on
CPU Upgrade
Feature Request: change request file format to disambiguate chat and non-chat models?
such as instead of:ModelName-SizeB_eval_request_False_bfloat16_Original.json
perhaps,
ModelName-SizeB_eval_request_False_bfloat16_
ChatOn_Original.json
ModelName-SizeB_eval_request_False_bfloat16_
ChatOff_Original.json
- so that they don't overwrite each other on reruns
and also the scores seem to be, (esp for some models), significantly different, depending on config - so that chat on and chat off entries are listed separately on the leaderboard
from what i'm seeing, is that for some similar models,
it seems that, chat template affects IFEval scores (โ), & MUSR too (โ), (but by how much?)
if this is to be updated, maybe look into the request files' commit history, also the multiple result files (which don't overwrite),
to help disambiguate & sort things out,
The chat template's, its effect on scores, seems to have a more significant impact than: bfloat16 vs float16
Question: what determines what chat-template will be used, what file or process, (e.g. generation_config.json
) , what else, or other assumptions / defaults ?
Hi @CombinHorizon ,
Thank you for your suggestion!
We agree that this modification can help to compare a model with and without the chat template. We're actually in the process of revamping our request naming system, as some current parameters are no longer relevant
We'll come back to you as soon as we have decided on a new simpler format!
Question: does the leaderboard Chat Template
column - use info from the request or result file?
If it gets its data from the request file,
in the edge-case where a model
- First, is submitted and completed successfully with one chat setting
- After that, the model is updated,
- Then, a request is submitted with the opposite chat toggle setting
- And lastly, the request fails
If that were the case, wouldn't it display the wrong info in that column?! (since the request file is overwritten), It's not a major issue, but something to keep in mind, that's why referencing commit history matters, for accurate accounting.
Hi @CombinHorizon ,
Yes, we use the info from request files to create a Chat Template
column, and there is no such problem now, as the parser is restarted every few hours and checks the request files anew each time. Plus we don't parse failed models
about presence or absence of a chat-template, its affect on scores, (using the new UI), (14B models) (it's more readable):
could a normalized second score be used relative to similar models, with the same size/settings/class, be used (other than rank) to help compare similar models
There's also the question of chat-template settings, they affect the scores, and what is a good idea to mitigate this?, to make the models more easy to compare in detail with each other?,
e.g. a way to re-submit & view scores with the opposite chat setting, (esp. for selected well performing models)
and perhaps more advanced graphics/filtering interfaces?
see: other recent previous discussions about chat template affecting scores