Spaces:
Running
on
CPU Upgrade
Repeated failures of various running models
as per https://huggingface.co./datasets/open-cn-llm-leaderboard/requests/commits/main
cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b has been repeatedly failing,
does something in the maybe back-end need to be updated to better support nemo models?
examples where it works
- https://huggingface.co./datasets/open-llm-leaderboard/results/blob/main/cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b/results_2024-07-30T05-46-43.237089.json
- https://huggingface.co./datasets/open-llm-leaderboard/results/blob/main/VAGOsolutions/SauerkrautLM-Nemo-12b-Instruct/results_2024-07-25T10-48-03.221253.json
- https://huggingface.co./datasets/open-llm-leaderboard/requests/blob/main/cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b_eval_request_False_bfloat16_Original.json
- https://huggingface.co./datasets/open-llm-leaderboard/requests/blob/main/VAGOsolutions/SauerkrautLM-Nemo-12b-Instruct_eval_request_False_bfloat16_Original.json
- https://huggingface.co./datasets/OALL/requests/blob/main/cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b_eval_request_False_bfloat16_Original.json
- https://huggingface.co./datasets/OALL/requests/blob/main/VAGOsolutions/SauerkrautLM-Nemo-12b-Instruct_eval_request_False_bfloat16_Original.json
Could it because there isn't enough reserve free-ram or capacity, so that as a model runs, and perhaps resource RAM usage fluctuations, cause some of the models to have OOM errors,
thus maybe not a specific model's fault?
but a perhaps, a problem with the how they are queued? (maybe too many running at the same time?)
Edit: question - when a model fails, and then is restarted with same settings (if same commit, param-s) does it have to redo all the tasks and tests, or is its progress remembered, and thus continues where it left off?, if not, would it be easy to implement that, wouldn't that save some resources? (but do take into account that different commits of the same model aren't necessarily the same, thus don't do that for those, not as good of an idea to treat them the same, thus perhaps keep them separate..