metadata

license: apache-2.0
tags:
  - merge
  - mergekit
  - epfl-llm/meditron-70b
  - allenai/tulu-2-dpo-70b
model-index:
  - name: Medmerge-tulu-70b
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 67.41
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/Medmerge-tulu-70b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 87.46
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/Medmerge-tulu-70b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 70.1
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/Medmerge-tulu-70b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 47.89
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/Medmerge-tulu-70b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 83.43
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/Medmerge-tulu-70b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 56.56
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/Medmerge-tulu-70b
          name: Open LLM Leaderboard

Medmerge-tulu-70b

Medmerge-tulu-70b is a merge of the following models:

Open LLM Leaderboard

Model Name	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8K
tulu-2-dpo-70b	72.1	88.99	69.84	65.78	83.27	62.62
Medmerge-tulu-70b	67.81	87.46	70.1	47.89	83.43	56.56

Performance

Clinical Camel demonstrates competitive performance on medical benchmarks.

Table: Five-Shot Performance of Clinical Camel-70B (C70), GPT3.5, GPT4, and Med-PaLM 2 on Various Medical Datasets

Dataset	Medmerge-tulu-70b	ClinicalCamel-70B	GPT3.5	GPT4	Med-PaLM 2
MMLU Anatomy	66.6	65.2	60.7	80.0	77.8
MMLU Clinical Knowledge	72.0	72.8	68.7	86.4	88.3
MMLU College Biology	84.7	81.2	72.9	93.8	94.4
MMLU College Medicine	64.2	68.2	63.6	76.3	80.9
MMLU Medical Genetics	76.0	69.0	68.0	92.0	90.0
MMLU Professional Medicine	75.7	75.0	69.8	93.8	95.2
MedMCQA		54.2	51.0	72.4	71.3
MedQA (USMLE)		60.7	53.6	81.4	79.7
PubMedQA		77.9	60.2	74.4	79.2
USMLE Sample Exam		64.3	58.5	86.6	-

🧩 Configuration

models:
  - model: NousResearch/Llama-2-70b-hf
    # no parameters necessary for base model
  - model: wanglab/ClinicalCamel-70B
    parameters:
      weight: 0.08
      density: 0.45
  - model: epfl-llm/meditron-70b
    parameters:
      weight: 0.08
      density: 0.45
  - model: allenai/tulu-2-dpo-70b
    parameters:
      weight: 0.08
      density: 0.45
merge_method: dare_ties
base_model: NousResearch/Llama-2-70b-hf
parameters:
  int8_mask: true
dtype: bfloat16

💻 Usage

!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "Technoculture/Medmerge-tulu-70b"
messages = [{"role": "user", "content": "I am feeling sleepy these days"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	68.81
AI2 Reasoning Challenge (25-Shot)	67.41
HellaSwag (10-Shot)	87.46
MMLU (5-Shot)	70.10
TruthfulQA (0-shot)	47.89
Winogrande (5-shot)	83.43
GSM8k (5-shot)	56.56