Model evaluation and submission stuck of LB.

#17
by abideen - opened

Hi, The evaluation queue of Leaderboard has been stuck for a few days. Can you guys check it out and get it back up? Thank you.

it has been stuck since 2024-05may-31
(~35 days as of 2024-07jul-05)

previously it would run pretty quickly, not frozen progress (same numbers of finished/pending models for days)

Question: how much longer or how much more resources (vRAM or compute) does it take for float32 precision (vs float16 or bfloat16) to run, given a certain model size?

could it be , that too many float32 models are running at the same time, that is frozen like this?,
are there any logs, about the current progress for the running models, e.g. what task/sub-test is is on, is the progress moving forward, and any indications of ETA?

by stuck, we mean, the leaderboard is stuck at a count of only 231 finished models, with no more new ones  being added to the results
(see logs for timeline)

see
https://huggingface.co./datasets/openlifescienceai/requests/discussions/7
https://huggingface.co./datasets/openlifescienceai/requests/discussions/6
https://huggingface.co./datasets/openlifescienceai/requests/discussions/5
https://huggingface.co./datasets/openlifescienceai/requests/discussions/4
https://huggingface.co./datasets/openlifescienceai/requests/discussions/3
https://huggingface.co./datasets/openlifescienceai/requests/discussions/2

maybe delete these running float32 models, it'll probably unclog the leaderboard? ...
https://huggingface.co./datasets/openlifescienceai/requests/blob/main/cognitivecomputations/dolphin-2.9.1-yi-1.5-9b_eval_request_False_float32_Original.json
https://huggingface.co./datasets/openlifescienceai/requests/blob/main/wenbopan/Faro-Yi-9B-DPO_eval_request_False_float32_Original.json
https://huggingface.co./datasets/openlifescienceai/requests/blob/main/wenbopan/Faro-Yi-9B_eval_request_False_float32_Original.json
https://huggingface.co./datasets/openlifescienceai/requests/blob/main/01-ai/Yi-1.5-9B-Chat-16K_eval_request_False_float32_Original.json
https://huggingface.co./datasets/openlifescienceai/requests/blob/main/01-ai/Yi-1.5-9B-Chat_eval_request_False_float32_Original.json
https://huggingface.co./datasets/openlifescienceai/requests/blob/main/vicgalle/Configurable-Yi-1.5-9B-Chat_eval_request_False_float32_Original.json

@aaditya @aryopg , any updates?

i get how float32 is cool, if it were feasible, but is the difference on the huggingface leaderboard, the difference between float16 and bfloat16 enough ? - often only a few tenths of a percentage points, something to keep in mind. Could the number of concurrent float32 running models be limited/de-prioritized, without restarting the progress (rerun all over again), to prevent clogging?, could there be info/logging about how much progress-status / about the current (sub-question/sub-test task it's on) to let us know the ETA and how well it's moving, if at all, to help gauge if it's worth progressing? Over this period of time, aren't newer and better models coming out?, maybe?, what is a good way to weigh this?

https://huggingface.co./datasets/openlifescienceai/requests/blob/main/01-ai/Yi-1.5-9B-Chat-16K_eval_request_False_float32_Original.json
https://huggingface.co./datasets/openlifescienceai/requests/blob/main/01-ai/Yi-1.5-9B-Chat_eval_request_False_float32_Original.json

Are more closer to the core-model, it would make somewhat sense to prioritize these two, maybe more than the others... , perhaps?

@clefourrier Is HuggingFace aware that the submission of models for evaluation on this leaderboard seems to be stuck?
It is still the same queue status as when I looked last Friday.

Reading the discussions here it is already longer an issue.

Would it be possible to take a look at these issues?

Open Life Science AI org

Hi @robinsmits , sorry for the inconvenience. We’re currently upgrading the GPUs in the backend, along with making several other improvements. The delays in the queue are mostly due to GPU allocation and processing speed. We appreciate your patience as we work through these issues.

@aaditya Ok thanks for clarifying :-)

If it's a matter of computational resources, I can work with you to get the evaluations run. I have access to both Biowulf and another internal HPC with GPU nodes.

Thanks for the attention to this, but has there been any progress on this lately?

still the leaderboard is cloged it seems

@aaditya When can we expect this Leaderboard to be operational again?

Some models that were already reported on July 24th are still in the queue.

I can't possible imagine that it would take more than 2 months to add a few GPU's?? If I'am wrong than wouldn't it be an idea to ask HuggingFace for assistance?

additionally, this model SrikanthChellappa/Collaiborator-MEDLLM-Llama-3-8B-v2-7 in the running list has been deleted/removed, can it be removed from the running list?

request file:
https://huggingface.co./datasets/openlifescienceai/requests/blob/main/SrikanthChellappa/Collaiborator-MEDLLM-Llama-3-8B-v2-7_eval_request_False_bfloat16_Original.json

@aaditya @aryopg @clefourrier

also, leaderboard has been stuck for months, is float32 worth it?
or maybe should some triage/management be added?, so that there is a limit to the number of float32 models that may run at the same time?
i've seen float32 runs (for mid-(or larger)-size models on another leaderboard) - freeze the processing of non-float32 models also, if they run at the same time,

Sign up or log in to comment