Spaetzle-v69-7b / README.md
cstr's picture
Update README.md
e472e2f verified
metadata
tags:
  - merge
  - mergekit
  - lazymergekit
language:
  - de
  - en
base_model:
  - abideen/AlphaMonarch-dora
  - mayflowergmbh/Wiedervereinigung-7b-dpo
  - flemmingmiguel/NeuDist-Ro-7B
  - ResplendentAI/Flora_DPO_7B
  - yleo/EmertonMonarch-7B
  - occiglot/occiglot-7b-de-en-instruct
  - OpenPipe/mistral-ft-optimized-1227
  - DiscoResearch/DiscoLM_German_7b_v1
  - LeoLM/leo-mistral-hessianai-7b
  - DRXD1000/Phoenix
  - VAGOsolutions/SauerkrautLM-7b-v1-mistral
  - malteos/hermeo-7b
  - FelixChao/WestSeverus-7B-DPO-v2
  - cognitivecomputations/openchat-3.5-0106-laser
license: cc-by-nc-4.0

Spaetzle-v69-7b

This is a progressive (mostly dare-ties, but also slerp) merge with the intention of a suitable compromise for English and German local tasks.

There is also a 4q_k_m quantized GGUF.

It should work sufficiently well with ChatML prompt template (for all merged models should have seen ChatML prompts at least in DPO stage).

Evaluation

Benchmark scores are not the possible optimum, as the model attempts a compromise with a number of parameters, like German language performance, instruction following, reasoning capabilities, robustness (so far, i did not encounter inserted tokens, e.g.), model licensing, and other criteria. Nevertheless, they are not too bad:

It achieves (running quantized) in

  • German EQ Bench: Score (v2_de): 62.59 (Parseable: 171.0).
  • English EQ Bench: Score (v2): 76.43 (Parseable: 171.0).

Open LLM Leaderboard Evaluation Results: Detailed results can be found here

Metric Value
Avg. 72.87
AI2 Reasoning Challenge (25-Shot) 69.54
HellaSwag (10-Shot) 86.77
MMLU (5-Shot) 64.63
TruthfulQA (0-shot) 65.61
Winogrande (5-shot) 81.93
GSM8k (5-shot) 68.76

Nous benchmark results:

Model AGIEval GPT4All TruthfulQA Bigbench Average
Spaetzle-v69-7b 44.48 75.84 66.15 46.59 58.27

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 25.98 ± 2.76
acc_norm 23.62 ± 2.67
agieval_logiqa_en 0 acc 39.78 ± 1.92
acc_norm 39.48 ± 1.92
agieval_lsat_ar 0 acc 23.48 ± 2.80
acc_norm 23.91 ± 2.82
agieval_lsat_lr 0 acc 50.00 ± 2.22
acc_norm 51.76 ± 2.21
agieval_lsat_rc 0 acc 63.94 ± 2.93
acc_norm 64.31 ± 2.93
agieval_sat_en 0 acc 76.70 ± 2.95
acc_norm 77.67 ± 2.91
agieval_sat_en_without_passage 0 acc 46.12 ± 3.48
acc_norm 44.17 ± 3.47
agieval_sat_math 0 acc 34.09 ± 3.20
acc_norm 30.91 ± 3.12

Average: 44.48%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 63.23 ± 1.41
acc_norm 64.16 ± 1.40
arc_easy 0 acc 85.90 ± 0.71
acc_norm 82.49 ± 0.78
boolq 1 acc 87.80 ± 0.57
hellaswag 0 acc 67.05 ± 0.47
acc_norm 85.19 ± 0.35
openbookqa 0 acc 38.40 ± 2.18
acc_norm 48.40 ± 2.24
piqa 0 acc 82.75 ± 0.88
acc_norm 84.28 ± 0.85
winogrande 0 acc 78.53 ± 1.15

Average: 75.84%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 50.67 ± 1.75
mc2 66.15 ± 1.48

Average: 66.15%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 56.84 ± 3.60
bigbench_date_understanding 0 multiple_choice_grade 66.67 ± 2.46
bigbench_disambiguation_qa 0 multiple_choice_grade 40.70 ± 3.06
bigbench_geometric_shapes 0 multiple_choice_grade 24.79 ± 2.28
exact_str_match 10.58 ± 1.63
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 31.00 ± 2.07
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 23.00 ± 1.59
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 58.00 ± 2.85
bigbench_movie_recommendation 0 multiple_choice_grade 45.80 ± 2.23
bigbench_navigate 0 multiple_choice_grade 52.10 ± 1.58
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 69.55 ± 1.03
bigbench_ruin_names 0 multiple_choice_grade 48.88 ± 2.36
bigbench_salient_translation_error_detection 0 multiple_choice_grade 30.96 ± 1.46
bigbench_snarks 0 multiple_choice_grade 73.48 ± 3.29
bigbench_sports_understanding 0 multiple_choice_grade 74.14 ± 1.40
bigbench_temporal_sequences 0 multiple_choice_grade 42.70 ± 1.56
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 23.60 ± 1.20
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 18.40 ± 0.93
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 58.00 ± 2.85

Average: 46.59%

Average score: 58.27%

🧩 Merge Configuration

Spaetzle-v69-7b is a merge of the following models using LazyMergekit:

The merge tree in total involves the following original models:

For this last merge:

models:
  - model: cstr/Spaetzle-v68-7b
    # no parameters necessary for base model
  - model: abideen/AlphaMonarch-dora
    parameters:
      density: 0.60
      weight: 0.30
merge_method: dare_ties
base_model: cstr/Spaetzle-v68-7b
parameters:
  int8_mask: true
dtype: bfloat16
random_seed: 0
tokenizer_source: base

💻 Usage

!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "cstr/Spaetzle-v69-7b"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])