benhaotang commited on
Commit
40a4b69
·
verified ·
1 Parent(s): de04971

Adding Evaluation Results

Browse files

This is an automated PR created with [this space](https://huggingface.co./spaces/T145/open-llm-leaderboard-results-to-modelcard)!

The purpose of this PR is to add evaluation results from the Open LLM Leaderboard to your model card.

Please report any issues here: https://huggingface.co./spaces/T145/open-llm-leaderboard-results-to-modelcard/discussions

Files changed (1) hide show
  1. README.md +114 -1
README.md CHANGED
@@ -10,6 +10,105 @@ tags:
10
  datasets:
11
  - NovaSky-AI/Sky-T1_data_17k
12
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ---
14
  # merge
15
 
@@ -78,4 +177,18 @@ parameters:
78
  normalize: false
79
  int8_mask: true
80
  dtype: float16
81
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  datasets:
11
  - NovaSky-AI/Sky-T1_data_17k
12
  license: mit
13
+ model-index:
14
+ - name: phi4-qwq-sky-t1
15
+ results:
16
+ - task:
17
+ type: text-generation
18
+ name: Text Generation
19
+ dataset:
20
+ name: IFEval (0-Shot)
21
+ type: wis-k/instruction-following-eval
22
+ split: train
23
+ args:
24
+ num_few_shot: 0
25
+ metrics:
26
+ - type: inst_level_strict_acc and prompt_level_strict_acc
27
+ value: 4.6
28
+ name: averaged accuracy
29
+ source:
30
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=benhaotang%2Fphi4-qwq-sky-t1
31
+ name: Open LLM Leaderboard
32
+ - task:
33
+ type: text-generation
34
+ name: Text Generation
35
+ dataset:
36
+ name: BBH (3-Shot)
37
+ type: SaylorTwift/bbh
38
+ split: test
39
+ args:
40
+ num_few_shot: 3
41
+ metrics:
42
+ - type: acc_norm
43
+ value: 52.61
44
+ name: normalized accuracy
45
+ source:
46
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=benhaotang%2Fphi4-qwq-sky-t1
47
+ name: Open LLM Leaderboard
48
+ - task:
49
+ type: text-generation
50
+ name: Text Generation
51
+ dataset:
52
+ name: MATH Lvl 5 (4-Shot)
53
+ type: lighteval/MATH-Hard
54
+ split: test
55
+ args:
56
+ num_few_shot: 4
57
+ metrics:
58
+ - type: exact_match
59
+ value: 39.58
60
+ name: exact match
61
+ source:
62
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=benhaotang%2Fphi4-qwq-sky-t1
63
+ name: Open LLM Leaderboard
64
+ - task:
65
+ type: text-generation
66
+ name: Text Generation
67
+ dataset:
68
+ name: GPQA (0-shot)
69
+ type: Idavidrein/gpqa
70
+ split: train
71
+ args:
72
+ num_few_shot: 0
73
+ metrics:
74
+ - type: acc_norm
75
+ value: 19.35
76
+ name: acc_norm
77
+ source:
78
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=benhaotang%2Fphi4-qwq-sky-t1
79
+ name: Open LLM Leaderboard
80
+ - task:
81
+ type: text-generation
82
+ name: Text Generation
83
+ dataset:
84
+ name: MuSR (0-shot)
85
+ type: TAUR-Lab/MuSR
86
+ args:
87
+ num_few_shot: 0
88
+ metrics:
89
+ - type: acc_norm
90
+ value: 21.38
91
+ name: acc_norm
92
+ source:
93
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=benhaotang%2Fphi4-qwq-sky-t1
94
+ name: Open LLM Leaderboard
95
+ - task:
96
+ type: text-generation
97
+ name: Text Generation
98
+ dataset:
99
+ name: MMLU-PRO (5-shot)
100
+ type: TIGER-Lab/MMLU-Pro
101
+ config: main
102
+ split: test
103
+ args:
104
+ num_few_shot: 5
105
+ metrics:
106
+ - type: acc
107
+ value: 47.16
108
+ name: accuracy
109
+ source:
110
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=benhaotang%2Fphi4-qwq-sky-t1
111
+ name: Open LLM Leaderboard
112
  ---
113
  # merge
114
 
 
177
  normalize: false
178
  int8_mask: true
179
  dtype: float16
180
+ ```
181
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
182
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/benhaotang__phi4-qwq-sky-t1-details)!
183
+ Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=benhaotang%2Fphi4-qwq-sky-t1&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)!
184
+
185
+ | Metric |Value (%)|
186
+ |-------------------|--------:|
187
+ |**Average** | 30.78|
188
+ |IFEval (0-Shot) | 4.60|
189
+ |BBH (3-Shot) | 52.61|
190
+ |MATH Lvl 5 (4-Shot)| 39.58|
191
+ |GPQA (0-shot) | 19.35|
192
+ |MuSR (0-shot) | 21.38|
193
+ |MMLU-PRO (5-shot) | 47.16|
194
+