Text Generation
Transformers
Safetensors
Finnish
llama
finnish
conversational
text-generation-inference
aapot commited on
Commit
0b51e96
·
verified ·
1 Parent(s): 137bf09

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -50
README.md CHANGED
@@ -203,40 +203,40 @@ This Ahma 3B base model was primarily evaluated using [FIN-bench by TurkuNLP](ht
203
 
204
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
205
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
206
- | Analogies | 50.77 | 48.46 | 56.92 | TBA | 49.23 | 40.00 | 54.62 |
207
- | Arithmetic | 27.64 | 22.14 | 11.50 | TBA | 33.15 | 30.16 | 30.34 |
208
- | Cause and Effect | 59.48 | 58.82 | 59.48 | TBA | 66.01 | 58.82 | 62.74 |
209
- | Emotions | 36.25 | 28.12 | 36.25 | TBA | 22.50 | 26.25 | 35.63 |
210
- | Empirical Judgements | 33.33 | 35.35 | 33.33 | TBA | 27.27 | 33.33 | 49.49 |
211
- | General Knowledge | 44.29 | 48.57 | 51.43 | TBA | 40.00 | 24.29 | 51.43 |
212
- | HHH Alignment | 42.09 | 41.66 | 44.23 | TBA | 41.81 | 42.51 | 42.92 |
213
- | Intent Recognition | 24.42 | 26.16 | 43.64 | TBA | 17.49 | 22.40 | 68.35 |
214
- | Misconceptions | 46.27 | 47.01 | 46.27 | TBA | 53.73 | 53.73 | 52.24 |
215
- | Paraphrase | 59.50 | 73.00 | 67.00 | TBA | 51.00 | 50.00 | 51.00 |
216
- | Sentence Ambiguity | 53.33 | 65.00 | 60.00 | TBA | 51.67 | 48.33 | 50.00 |
217
- | Similarities Abstraction | 65.79 | 68.42 | 71.05 | TBA | 60.53 | 65.79 | 60.53 |
218
- | **Non-Arithmetic Average** | **47.55** | **48.95** | **51.33** | TBA | **46.17** | **44.42** | **52.08** |
219
- | **Overall Average** | **36.49** | **34.06** | **29.20** | TBA | **38.93** | **36.50** | **40.00** |
220
 
221
 
222
  3-shot results:
223
 
224
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
225
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
226
- | Analogies | 50.77 | 49.23 | 49.23 | TBA | 40.77 | 54.62 | 76.92 |
227
- | Arithmetic | 38.38 | 43.89 | 20.88 | TBA | 43.63 | 45.78 | 53.68 |
228
- | Cause and Effect | 60.78 | 64.71 | 66.01 | TBA | 64.05 | 58.17 | 67.32 |
229
- | Emotions | 30.00 | 41.25 | 30.00 | TBA | 44.37 | 48.13 | 56.87 |
230
- | Empirical Judgements | 46.46 | 44.44 | 39.39 | TBA | 32.32 | 43.43 | 63.64 |
231
- | General Knowledge | 47.14 | 40.00 | 27.14 | TBA | 54.29 | 28.57 | 74.29 |
232
- | HHH Alignment | 43.53 | 44.80 | 43.80 | TBA | 45.39 | 44.80 | 46.07 |
233
- | Intent Recognition | 20.52 | 44.22 | 36.42 | TBA | 51.45 | 58.82 | 83.67 |
234
- | Misconceptions | 50.75 | 52.24 | 46.27 | TBA | 52.99 | 46.27 | 52.99 |
235
- | Paraphrase | 50.50 | 58.50 | 57.50 | TBA | 53.00 | 54.50 | 55.00 |
236
- | Sentence Ambiguity | 53.33 | 48.33 | 53.33 | TBA | 51.67 | 53.33 | 66.67 |
237
- | Similarities Abstraction | 69.74 | 72.37 | 72.37 | TBA | 64.47 | 73.68 | 75.00 |
238
- | **Non-Arithmetic Average** | **48.48** | **51.49** | **49.05** | TBA | **51.19** | **50.94** | **61.96** |
239
- | **Overall Average** | **42.87** | **47.27** | **33.41** | TBA | **46.99** | **48.07** | **57.36** |
240
 
241
 
242
  As we can see, Ahma 3B base model outperforms 2X larger models like the FinGPT 8B and Viking 7B, especially in non-arithmetic tasks in 0-shot usage. Even the 10X larger Poro 34B model, which is generally better, doesn't show a huge performance difference considering its size, and Ahma 3B actually surpasses it in some tasks. This result might be attributed to Ahma's 2-stage pretraining and the inclusion of instruct-following examples during the pretraining phase.
@@ -250,31 +250,31 @@ This Ahma 3B base model was also evaluated using [MTBench Finnish by LumiOpen](h
250
 
251
  Single-turn results:
252
 
253
- | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct |
254
- |:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|
255
- | Coding | 1.00 | 1.00 | 1.70 | TBA |
256
- | Extraction | 2.00 | 1.30 | 3.10 | TBA |
257
- | Humanities | 4.05 | 6.20 | 6.60 | TBA |
258
- | Math | 3.00 | 3.20 | 3.90 | TBA |
259
- | Reasoning | 2.90 | 4.60 | 3.70 | TBA |
260
- | Roleplay | 4.80 | 6.50 | 6.60 | TBA |
261
- | STEM | 5.10 | 5.95 | 6.75 | TBA |
262
- | Writing | 6.60 | 9.00 | 7.10 | TBA |
263
- | **Overall Average** | **3.68** | **4.72** | **4.93** | TBA |
264
 
265
  Multi-turn results:
266
 
267
- | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct | Poro 34B Chat |
268
- |:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|:--------------|
269
- | Coding | 1.00 | 1.00 | 1.40 | TBA | 3.70 |
270
- | Extraction | 1.55 | 1.15 | 2.05 | TBA | 6.37 |
271
- | Humanities | 3.25 | 6.20 | 4.95 | TBA | 9.25 |
272
- | Math | 2.20 | 2.70 | 2.50 | TBA | 1.20 |
273
- | Reasoning | 2.45 | 3.50 | 2.55 | TBA | 4.35 |
274
- | Roleplay | 4.90 | 6.40 | 6.35 | TBA | 7.35 |
275
- | STEM | 4.20 | 4.78 | 4.28 | TBA | 7.80 |
276
- | Writing | 3.80 | 6.65 | 4.10 | TBA | 8.50 |
277
- | **Overall Average** | **2.92** | **4.05** | **3.52** | TBA | **6.06** |
278
 
279
  As we can see, Ahma 3B base model struggles with multi-turn examples, as expected, since it has only been pretrained with single-turn instruction following examples. In addition, coding performance was expectedly poor because the Ahma 3B model is not trained with code data. Ahma 3B also seemed to have problems with the fact that it started to constantly repeat the generated text in some evaluation examples, which affected the scoring. With the addition of a repetition penalty setting to the evaluation script generation method, the scores already improved significantly, so the Ahma 3B model should be used with better generation settings in real-world use compared to the settings used in this benchmark.
280
 
 
203
 
204
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
205
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
206
+ | Analogies | 50.77 | 48.46 | 56.92 | 41.54 | 49.23 | 40.00 | 54.62 |
207
+ | Arithmetic | 27.64 | 22.14 | 11.50 | 14.70 | 33.15 | 30.16 | 30.34 |
208
+ | Cause and Effect | 59.48 | 58.82 | 59.48 | 53.60 | 66.01 | 58.82 | 62.74 |
209
+ | Emotions | 36.25 | 28.12 | 36.25 | 27.50 | 22.50 | 26.25 | 35.63 |
210
+ | Empirical Judgements | 33.33 | 35.35 | 33.33 | 33.33 | 27.27 | 33.33 | 49.49 |
211
+ | General Knowledge | 44.29 | 48.57 | 51.43 | 37.14 | 40.00 | 24.29 | 51.43 |
212
+ | HHH Alignment | 42.09 | 41.66 | 44.23 | 43.22 | 41.81 | 42.51 | 42.92 |
213
+ | Intent Recognition | 24.42 | 26.16 | 43.64 | 56.94 | 17.49 | 22.40 | 68.35 |
214
+ | Misconceptions | 46.27 | 47.01 | 46.27 | 47.01 | 53.73 | 53.73 | 52.24 |
215
+ | Paraphrase | 59.50 | 73.00 | 67.00 | 70.50 | 51.00 | 50.00 | 51.00 |
216
+ | Sentence Ambiguity | 53.33 | 65.00 | 60.00 | 63.33 | 51.67 | 48.33 | 50.00 |
217
+ | Similarities Abstraction | 65.79 | 68.42 | 71.05 | 61.84 | 60.53 | 65.79 | 60.53 |
218
+ | **Non-Arithmetic Average** | **47.55** | **48.95** | **51.33** | **48.30** | **46.17** | **44.42** | **52.08** |
219
+ | **Overall Average** | **36.49** | **34.06** | **29.20** | **29.64** | **38.93** | **36.50** | **40.00** |
220
 
221
 
222
  3-shot results:
223
 
224
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
225
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
226
+ | Analogies | 50.77 | 49.23 | 49.23 | 43.08 | 40.77 | 54.62 | 76.92 |
227
+ | Arithmetic | 38.38 | 43.89 | 20.88 | 26.81 | 43.63 | 45.78 | 53.68 |
228
+ | Cause and Effect | 60.78 | 64.71 | 66.01 | 62.74 | 64.05 | 58.17 | 67.32 |
229
+ | Emotions | 30.00 | 41.25 | 30.00 | 53.75 | 44.37 | 48.13 | 56.87 |
230
+ | Empirical Judgements | 46.46 | 44.44 | 39.39 | 39.39 | 32.32 | 43.43 | 63.64 |
231
+ | General Knowledge | 47.14 | 40.00 | 27.14 | 44.29 | 54.29 | 28.57 | 74.29 |
232
+ | HHH Alignment | 43.53 | 44.80 | 43.80 | 45.09 | 45.39 | 44.80 | 46.07 |
233
+ | Intent Recognition | 20.52 | 44.22 | 36.42 | 39.02 | 51.45 | 58.82 | 83.67 |
234
+ | Misconceptions | 50.75 | 52.24 | 46.27 | 51.49 | 52.99 | 46.27 | 52.99 |
235
+ | Paraphrase | 50.50 | 58.50 | 57.50 | 65.00 | 53.00 | 54.50 | 55.00 |
236
+ | Sentence Ambiguity | 53.33 | 48.33 | 53.33 | 51.67 | 51.67 | 53.33 | 66.67 |
237
+ | Similarities Abstraction | 69.74 | 72.37 | 72.37 | 69.74 | 64.47 | 73.68 | 75.00 |
238
+ | **Non-Arithmetic Average** | **48.48** | **51.49** | **49.05** | **51.63** | **51.19** | **50.94** | **61.96** |
239
+ | **Overall Average** | **42.87** | **47.27** | **33.41** | **37.84** | **46.99** | **48.07** | **57.36** |
240
 
241
 
242
  As we can see, Ahma 3B base model outperforms 2X larger models like the FinGPT 8B and Viking 7B, especially in non-arithmetic tasks in 0-shot usage. Even the 10X larger Poro 34B model, which is generally better, doesn't show a huge performance difference considering its size, and Ahma 3B actually surpasses it in some tasks. This result might be attributed to Ahma's 2-stage pretraining and the inclusion of instruct-following examples during the pretraining phase.
 
250
 
251
  Single-turn results:
252
 
253
+ | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) |
254
+ |:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|
255
+ | Coding | 1.00 | 1.00 | 1.70 | 1.10 |
256
+ | Extraction | 2.00 | 1.30 | 3.10 | 3.00 |
257
+ | Humanities | 4.05 | 6.20 | 6.60 | 8.00 |
258
+ | Math | 3.00 | 3.20 | 3.90 | 2.90 |
259
+ | Reasoning | 2.90 | 4.60 | 3.70 | 5.70 |
260
+ | Roleplay | 4.80 | 6.50 | 6.60 | 7.20 |
261
+ | STEM | 5.10 | 5.95 | 6.75 | 7.30 |
262
+ | Writing | 6.60 | 9.00 | 7.10 | 8.80 |
263
+ | **Overall Average** | **3.68** | **4.72** | **4.93** | **5.50** |
264
 
265
  Multi-turn results:
266
 
267
+ | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | Poro 34B Chat |
268
+ |:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:--------------|
269
+ | Coding | 1.00 | 1.00 | 1.40 | 1.05 | 3.70 |
270
+ | Extraction | 1.55 | 1.15 | 2.05 | 2.65 | 6.37 |
271
+ | Humanities | 3.25 | 6.20 | 4.95 | 7.85 | 9.25 |
272
+ | Math | 2.20 | 2.70 | 2.50 | 2.40 | 1.20 |
273
+ | Reasoning | 2.45 | 3.50 | 2.55 | 4.50 | 4.35 |
274
+ | Roleplay | 4.90 | 6.40 | 6.35 | 6.60 | 7.35 |
275
+ | STEM | 4.20 | 4.78 | 4.28 | 5.40 | 7.80 |
276
+ | Writing | 3.80 | 6.65 | 4.10 | 6.25 | 8.50 |
277
+ | **Overall Average** | **2.92** | **4.05** | **3.52** | **4.59** | **6.06** |
278
 
279
  As we can see, Ahma 3B base model struggles with multi-turn examples, as expected, since it has only been pretrained with single-turn instruction following examples. In addition, coding performance was expectedly poor because the Ahma 3B model is not trained with code data. Ahma 3B also seemed to have problems with the fact that it started to constantly repeat the generated text in some evaluation examples, which affected the scoring. With the addition of a repetition penalty setting to the evaluation script generation method, the scores already improved significantly, so the Ahma 3B model should be used with better generation settings in real-world use compared to the settings used in this benchmark.
280