How the scores are calculated

#1028
by csabakecskemeti - opened

This maybe a lame question, but I'm wondering how the single scores for the task group is calculated?
Are those just averages of the tasks scores within the group?

Like:
|leaderboard_musr | N/A| | | | | | | |
| - leaderboard_musr_murder_mysteries | 1|none | 0|acc_norm |↑ |0.5280|± |0.0316|
| - leaderboard_musr_object_placements | 1|none | 0|acc_norm |↑ |0.3008|± |0.0287|
| - leaderboard_musr_team_allocation | 1|none | 0|acc_norm |↑ |0.2200|± |0.0263|

So MUSR should be 0.3496 ?

Open LLM Leaderboard org

Hi @csabakecskemeti ,

Not a lame question at all! Let me explain how the task group score (like MUSR) is calculated with a small code example.

The score for a task group is the average of the normalized scores for all tasks within that group. For your example:

murder_mysteries = 0.5280
object_placements = 0.3008
team_allocation=0.2200

The MUSR score is calculated as:

scores = [0.5280, 0.3008, 0.2200]
musr_score = sum(scores) / len(scores)
print(f"MUSR Score: {musr_score:.4f}")
# Output: MUSR Score: 0.3496

In our parser, this simplified logic might look like this:

import numpy as np

# Example scores for tasks within the group
task_scores = {
    "leaderboard_musr_murder_mysteries": 0.5280,
    "leaderboard_musr_object_placements": 0.3008,
    "leaderboard_musr_team_allocation": 0.2200,
}

# Compute the average score for the group
task_group_name = "MUSR"
task_group_score = np.mean(list(task_scores.values()))

print(f"{task_group_name} Score: {task_group_score:.4f}")
# Output: MUSR Score: 0.3496

We normalize the scores to ensure consistency across tasks, especially when different tasks have varying metrics or scales. This guarantees a fair representation in the group score. You will find more info in our documentation

I hope this clears things up. Feel free to ask if you have more questions!

Open LLM Leaderboard org

I'm closing this discussion, feel free to open a new one in case of any questions!

alozowski changed discussion status to closed
alozowski changed discussion title from How the cores calculated to How the scores are calculated

Sign up or log in to comment