Medmerge-tulu-70b / README.md
leaderboard-pr-bot's picture
Adding Evaluation Results
2f89639 verified
|
raw
history blame
6.87 kB
metadata
license: apache-2.0
tags:
  - merge
  - mergekit
  - epfl-llm/meditron-70b
  - allenai/tulu-2-dpo-70b
model-index:
  - name: Medmerge-tulu-70b
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 67.41
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/Medmerge-tulu-70b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 87.46
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/Medmerge-tulu-70b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 70.1
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/Medmerge-tulu-70b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 47.89
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/Medmerge-tulu-70b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 83.43
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/Medmerge-tulu-70b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 56.56
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/Medmerge-tulu-70b
          name: Open LLM Leaderboard

Medmerge-tulu-70b

Medmerge-tulu-70b is a merge of the following models:

Open LLM Leaderboard

image/png

Model Name ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K
tulu-2-dpo-70b 72.1 88.99 69.84 65.78 83.27 62.62
Medmerge-tulu-70b 67.81 87.46 70.1 47.89 83.43 56.56

Performance

Clinical Camel demonstrates competitive performance on medical benchmarks.

Table: Five-Shot Performance of Clinical Camel-70B (C70), GPT3.5, GPT4, and Med-PaLM 2 on Various Medical Datasets

Dataset Medmerge-tulu-70b ClinicalCamel-70B GPT3.5 GPT4 Med-PaLM 2
MMLU Anatomy 66.6 65.2 60.7 80.0 77.8
MMLU Clinical Knowledge 72.0 72.8 68.7 86.4 88.3
MMLU College Biology 84.7 81.2 72.9 93.8 94.4
MMLU College Medicine 64.2 68.2 63.6 76.3 80.9
MMLU Medical Genetics 76.0 69.0 68.0 92.0 90.0
MMLU Professional Medicine 75.7 75.0 69.8 93.8 95.2
MedMCQA 54.2 51.0 72.4 71.3
MedQA (USMLE) 60.7 53.6 81.4 79.7
PubMedQA 77.9 60.2 74.4 79.2
USMLE Sample Exam 64.3 58.5 86.6 -

🧩 Configuration

models:
  - model: NousResearch/Llama-2-70b-hf
    # no parameters necessary for base model
  - model: wanglab/ClinicalCamel-70B
    parameters:
      weight: 0.08
      density: 0.45
  - model: epfl-llm/meditron-70b
    parameters:
      weight: 0.08
      density: 0.45
  - model: allenai/tulu-2-dpo-70b
    parameters:
      weight: 0.08
      density: 0.45
merge_method: dare_ties
base_model: NousResearch/Llama-2-70b-hf
parameters:
  int8_mask: true
dtype: bfloat16

💻 Usage

!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "Technoculture/Medmerge-tulu-70b"
messages = [{"role": "user", "content": "I am feeling sleepy these days"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 68.81
AI2 Reasoning Challenge (25-Shot) 67.41
HellaSwag (10-Shot) 87.46
MMLU (5-Shot) 70.10
TruthfulQA (0-shot) 47.89
Winogrande (5-shot) 83.43
GSM8k (5-shot) 56.56