Model Card for MedGENIE-fid-flan-t5-base-medqa
MedGENIE comprises a collection of language models designed to utilize generated contexts, rather than retrieved ones, for addressing multiple-choice open-domain questions in the medical field. Specifically, MedGENIE-fid-flan-t5-base-medqa is a fusion-in-decoder (FID) model based on flan-t5-base, trained on the MedQA-USMLE dataset and grounded on artificial contexts generated by PMC-LLaMA-13B. This model achieves a new state-of-the-art (SOTA) performance over the corresponding test set.
Model description
- Language(s) (NLP): English
- License: MIT
- Finetuned from model: google/flan-t5-base
- Repository: https://github.com/disi-unibo-nlp/medgenie
- Paper: To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering
Performance
At the time of release (February 2024), MedGENIE-fid-flan-t5-base-medqa is a new lightweight SOTA model on MedQA-USMLE benchmark:
Model | Ground (Source) | Learning | Params | Accuracy (↓) |
---|---|---|---|---|
MedGENIE-FID-Flan-T5 | G (PMC-LLaMA) | Fine-tuned | 250M | 53.1 |
Codex (Liévin et al.) | ∅ | 0-zhot | 175B | 52.5 |
Codex (Liévin et al.) | R (Wikipedia) | 0-shot | 175B | 52.5 |
GPT-3.5-Turbo (Yang et al.) | R (Wikipedia) | k-shot | -- | 52.3 |
MEDITRON (Chen et al.) | ∅ | Fine-tuned | 7B | 52.0 |
BioMistral DARE (Labrak et al.) | ∅ | Fine-tuned | 7B | 51.1 |
BioMistral (Labrak et al.) | ∅ | Fine-tuned | 7B | 50.6 |
Zephyr-β | R (MedWiki) | 2-shot | 7B | 50.4 |
BioMedGPT (Luo et al.) | ∅ | k-shot | 10B | 50.4 |
BioMedLM (Singhal et al.) | ∅ | Fine-tuned | 2.7B | 50.3 |
PMC-LLaMA (awq 4 bit) | ∅ | Fine-tuned | 13B | 50.2 |
LLaMA-2 (Chen et al.) | ∅ | Fine-tuned | 7B | 49.6 |
Zephyr-β | ∅ | 2-shot | 7B | 49.6 |
Zephyr-β (Chen et al.) | ∅ | 3-shot | 7B | 49.2 |
PMC-LLaMA (Chen et al.) | ∅ | Fine-tuned | 7B | 49.2 |
DRAGON (Yasunaga et al.) | R (UMLS) | Fine-tuned | 360M | 47.5 |
InstructGPT (Liévin et al.) | R (Wikipedia) | 0-shot | 175B | 47.3 |
BioMistral DARE (Labrak et al.) | ∅ | 3-shot | 7B | 47.0 |
Flan-PaLM (Singhal et al.) | ∅ | 5-shot | 62B | 46.1 |
InstructGPT (Liévin et al.) | ∅ | 0-shot | 175B | 46.0 |
VOD (Liévin et al. 2023) | R (MedWiki) | Fine-tuned | 220M | 45.8 |
Vicuna 1.3 (Liévin et al.) | ∅ | 0-shot | 33B | 45.2 |
BioLinkBERT (Singhal et al.) | ∅ | Fine-tuned | 340M | 45.1 |
Mistral-Instruct | R (MedWiki) | 2-shot | 7B | 45.1 |
BioMistral (Labrak et al.) | ∅ | 3-shot | 7B | 44.4 |
Galactica | ∅ | 0-shot | 120B | 44.4 |
LLaMA-2 (Liévin et al.) | ∅ | 0-shot | 70B | 43.4 |
BioReader (Frisoni et al.) | R (PubMed-RCT) | Fine-tuned | 230M | 43.0 |
Guanaco (Liévin et al.) | ∅ | 0-shot | 33B | 42.9 |
LLaMA-2-chat (Liévin et al.) | ∅ | 0-shot | 70B | 42.3 |
Vicuna 1.5 (Liévin et al.) | ∅ | 0-shot | 65B | 41.6 |
Mistral-Instruct (Chen et al.) | ∅ | 3-shot | 7B | 41.1 |
PaLM (Singhal et al.) | ∅ | 5-shot | 62B | 40.9 |
Guanaco (Liévin et al.) | ∅ | 0-shot | 65B | 40.8 |
Falcon-Instruct (Liévin et al.) | ∅ | 0-shot | 40B | 39.0 |
Vicuna 1.3 (Liévin et al.) | ∅ | 0-shot | 13B | 38.7 |
GreaseLM (Zhang et al.) | R (UMLS) | Fine-tuned | 359M | 38.5 |
PubMedBERT (Singhal et al.) | ∅ | Fine-tuned | 110M | 38.1 |
QA-GNN (Yasunaga et al.) | R (UMLS) | Fine-tuned | 360M | 38.0 |
LLaMA-2 (Yang et al.) | R (Wikipedia) | k-shot | 13B | 37.6 |
LLaMA-2-chat | R (MedWiki) | 2-shot | 7B | 37.2 |
LLaMA-2-chat | ∅ | 2-shot | 7B | 37.2 |
BioBERT (Lee et al.) | ∅ | Fine-tuned | 110M | 36.7 |
MTP-Instruct (Liévin et al.) | ∅ | 0-shot | 30B | 35.1 |
GPT-Neo (Singhal et al.) | ∅ | Fine-tuned | 2.5B | 33.3 |
LLaMa-2-chat (Liévin et al.) | ∅ | 0-shot | 13B | 32.2 |
LLaMa-2 (Liévin et al.) | ∅ | 0-shot | 13B | 31.1 |
GPT-NeoX (Liévin et al.) | ∅ | 0-shot | 20B | 26.9 |
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- n_context: 5
- per_gpu_batch_size: 1
- accumulation_steps: 4
- total_steps: 40,712
- eval_freq: 10,178
- optimizer: AdamW
- scheduler: linear
- weight_decay: 0.01
- warmup_ratio: 0.1
- text_maxlength: 1024
Bias, Risk and Limitation
Our model is trained on artificially generated contextual documents, which might inadvertently magnify inherent biases and depart from clinical and societal norms. This could lead to the spread of convincing medical misinformation. To mitigate this risk, we recommend a cautious approach: domain experts should manually review any output before real-world use. This ethical safeguard is crucial to prevent the dissemination of potentially erroneous or misleading information, particularly within clinical and scientific circles.
Citation
If you find MedGENIE-fid-flan-t5-base-medqa is useful in your work, please cite it with:
@misc{frisoni2024generate,
title={To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering},
author={Giacomo Frisoni and Alessio Cocchieri and Alex Presepi and Gianluca Moro and Zaiqiao Meng},
year={2024},
eprint={2403.01924},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 12