Spaetzle
Collection
German-English models, mostly merged, some sft/dpo
•
117 items
•
Updated
•
1
llama3.1-8b-spaetzle-v90 is a progressive merge of merges.
German EQ-Bench v2_de: 69.93 (171/171). English (v2): 77.88 (171/171)
Open LLM Leaderboard Evaluation Results Detailed results can be found here
Metric | Value |
---|---|
Avg. | 27.59 |
IFEval (0-Shot) | 73.56 |
BBH (3-Shot) | 32.76 |
MATH Lvl 5 (4-Shot) | 13.37 |
GPQA (0-shot) | 4.36 |
MuSR (0-shot) | 11.15 |
MMLU-PRO (5-shot) | 30.34 |
Model | AGIEval | TruthfulQA | Bigbench |
---|---|---|---|
llama3.1-8b-spaetzle-v90 | 42.05 | 57.2 | 44.75 |
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
agieval_aqua_rat | 0 | acc | 24.02 | ± | 2.69 |
acc_norm | 23.62 | ± | 2.67 | ||
agieval_logiqa_en | 0 | acc | 40.09 | ± | 1.92 |
acc_norm | 39.78 | ± | 1.92 | ||
agieval_lsat_ar | 0 | acc | 22.17 | ± | 2.75 |
acc_norm | 21.74 | ± | 2.73 | ||
agieval_lsat_lr | 0 | acc | 50.39 | ± | 2.22 |
acc_norm | 45.29 | ± | 2.21 | ||
agieval_lsat_rc | 0 | acc | 64.31 | ± | 2.93 |
acc_norm | 58.36 | ± | 3.01 | ||
agieval_sat_en | 0 | acc | 81.07 | ± | 2.74 |
acc_norm | 73.79 | ± | 3.07 | ||
agieval_sat_en_without_passage | 0 | acc | 45.15 | ± | 3.48 |
acc_norm | 38.83 | ± | 3.40 | ||
agieval_sat_math | 0 | acc | 40.91 | ± | 3.32 |
acc_norm | 35.00 | ± | 3.22 |
Average: 42.05%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
truthfulqa_mc | 1 | mc1 | 39.66 | ± | 1.71 |
mc2 | 57.20 | ± | 1.51 |
Average: 57.2%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
bigbench_causal_judgement | 0 | multiple_choice_grade | 58.42 | ± | 3.59 |
bigbench_date_understanding | 0 | multiple_choice_grade | 70.46 | ± | 2.38 |
bigbench_disambiguation_qa | 0 | multiple_choice_grade | 31.40 | ± | 2.89 |
bigbench_geometric_shapes | 0 | multiple_choice_grade | 33.43 | ± | 2.49 |
exact_str_match | 0.00 | ± | 0.00 | ||
bigbench_logical_deduction_five_objects | 0 | multiple_choice_grade | 30.00 | ± | 2.05 |
bigbench_logical_deduction_seven_objects | 0 | multiple_choice_grade | 24.29 | ± | 1.62 |
bigbench_logical_deduction_three_objects | 0 | multiple_choice_grade | 56.00 | ± | 2.87 |
bigbench_movie_recommendation | 0 | multiple_choice_grade | 38.20 | ± | 2.18 |
bigbench_navigate | 0 | multiple_choice_grade | 50.20 | ± | 1.58 |
bigbench_reasoning_about_colored_objects | 0 | multiple_choice_grade | 69.50 | ± | 1.03 |
bigbench_ruin_names | 0 | multiple_choice_grade | 54.46 | ± | 2.36 |
bigbench_salient_translation_error_detection | 0 | multiple_choice_grade | 32.77 | ± | 1.49 |
bigbench_snarks | 0 | multiple_choice_grade | 65.19 | ± | 3.55 |
bigbench_sports_understanding | 0 | multiple_choice_grade | 50.30 | ± | 1.59 |
bigbench_temporal_sequences | 0 | multiple_choice_grade | 45.70 | ± | 1.58 |
bigbench_tracking_shuffled_objects_five_objects | 0 | multiple_choice_grade | 22.08 | ± | 1.17 |
bigbench_tracking_shuffled_objects_seven_objects | 0 | multiple_choice_grade | 17.03 | ± | 0.90 |
bigbench_tracking_shuffled_objects_three_objects | 0 | multiple_choice_grade | 56.00 | ± | 2.87 |
Average: 44.75%
The merge tree involves the following models:
There have been a number of steps involved, among which, slep merging of only middle layers compensating for tokenizer / chat template differences. An illustration below.
The final merge for this was:
models:
- model: cstr/llama3.1-8b-spaetzle-v59
# no parameters necessary for base model
- model: cstr/llama3.1-8b-spaetzle-v85
parameters:
density: 0.65
weight: 0.3
- model: cstr/llama3.1-8b-spaetzle-v86
parameters:
density: 0.65
weight: 0.3
- model: cstr/llama3.1-8b-spaetzle-v74
parameters:
density: 0.65
weight: 0.3
merge_method: dare_ties
base_model: cstr/llama3.1-8b-spaetzle-v59
parameters:
int8_mask: true
dtype: bfloat16
random_seed: 0
tokenizer_source: base
Among the previous steps:
models:
- model: NousResearch/Hermes-3-Llama-3.1-8B
merge_method: slerp
base_model: cstr/llama3.1-8b-spaetzle-v74
parameters:
t:
- value: [0, 0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0, 0]
dtype: float16
Use with llama3 chat template as common. Here are GGUF quants for use with llama.cpp & wrappers as e.g. ollama: cstr/llama3.1-8b-spaetzle-v90-GGUF