open-llm-leaderboard/open_llm_leaderboard · How the scores are calculated

25 days ago

This maybe a lame question, but I'm wondering how the single scores for the task group is calculated?
Are those just averages of the tasks scores within the group?

Like:
|leaderboard_musr | N/A| | | | | | | |
| - leaderboard_musr_murder_mysteries | 1|none | 0|acc_norm |↑ |0.5280|± |0.0316|
| - leaderboard_musr_object_placements | 1|none | 0|acc_norm |↑ |0.3008|± |0.0287|
| - leaderboard_musr_team_allocation | 1|none | 0|acc_norm |↑ |0.2200|± |0.0263|

So MUSR should be 0.3496 ?

alozowski

Open LLM Leaderboard org 23 days ago

Hi @csabakecskemeti ,

Not a lame question at all! Let me explain how the task group score (like MUSR) is calculated with a small code example.

The score for a task group is the average of the normalized scores for all tasks within that group. For your example:

murder_mysteries = 0.5280
object_placements = 0.3008
team_allocation=0.2200

The MUSR score is calculated as:

scores = [0.5280, 0.3008, 0.2200]
musr_score = sum(scores) / len(scores)
print(f"MUSR Score: {musr_score:.4f}")
# Output: MUSR Score: 0.3496

In our parser, this simplified logic might look like this:

import numpy as np

# Example scores for tasks within the group
task_scores = {
    "leaderboard_musr_murder_mysteries": 0.5280,
    "leaderboard_musr_object_placements": 0.3008,
    "leaderboard_musr_team_allocation": 0.2200,
}

# Compute the average score for the group
task_group_name = "MUSR"
task_group_score = np.mean(list(task_scores.values()))

print(f"{task_group_name} Score: {task_group_score:.4f}")
# Output: MUSR Score: 0.3496

We normalize the scores to ensure consistency across tasks, especially when different tasks have varying metrics or scales. This guarantees a fair representation in the group score. You will find more info in our documentation

I hope this clears things up. Feel free to ask if you have more questions!

csabakecskemeti

17 days ago

Thanks

alozowski

Open LLM Leaderboard org 17 days ago

I'm closing this discussion, feel free to open a new one in case of any questions!

alozowski changed discussion status to closed 17 days ago

alozowski changed discussion title from How the cores calculated to How the scores are calculated 17 days ago