Spaces:
Running
on
CPU Upgrade
How the scores are calculated
This maybe a lame question, but I'm wondering how the single scores for the task group is calculated?
Are those just averages of the tasks scores within the group?
Like:
|leaderboard_musr | N/A| | | | | | | |
| - leaderboard_musr_murder_mysteries | 1|none | 0|acc_norm |↑ |0.5280|± |0.0316|
| - leaderboard_musr_object_placements | 1|none | 0|acc_norm |↑ |0.3008|± |0.0287|
| - leaderboard_musr_team_allocation | 1|none | 0|acc_norm |↑ |0.2200|± |0.0263|
So MUSR should be 0.3496 ?
Hi @csabakecskemeti ,
Not a lame question at all! Let me explain how the task group score (like MUSR) is calculated with a small code example.
The score for a task group is the average of the normalized scores for all tasks within that group. For your example:
murder_mysteries = 0.5280
object_placements = 0.3008
team_allocation=0.2200
The MUSR
score is calculated as:
scores = [0.5280, 0.3008, 0.2200]
musr_score = sum(scores) / len(scores)
print(f"MUSR Score: {musr_score:.4f}")
# Output: MUSR Score: 0.3496
In our parser, this simplified logic might look like this:
import numpy as np
# Example scores for tasks within the group
task_scores = {
"leaderboard_musr_murder_mysteries": 0.5280,
"leaderboard_musr_object_placements": 0.3008,
"leaderboard_musr_team_allocation": 0.2200,
}
# Compute the average score for the group
task_group_name = "MUSR"
task_group_score = np.mean(list(task_scores.values()))
print(f"{task_group_name} Score: {task_group_score:.4f}")
# Output: MUSR Score: 0.3496
We normalize the scores to ensure consistency across tasks, especially when different tasks have varying metrics or scales. This guarantees a fair representation in the group score. You will find more info in our documentation
I hope this clears things up. Feel free to ask if you have more questions!
Thanks
I'm closing this discussion, feel free to open a new one in case of any questions!