Model Card for MedGENIE-fid-flan-t5-base-medqa

MedGENIE comprises a collection of language models designed to utilize generated contexts, rather than retrieved ones, for addressing multiple-choice open-domain questions in the medical field. Specifically, MedGENIE-fid-flan-t5-base-medqa is a fusion-in-decoder (FID) model based on flan-t5-base, trained on the MedQA-USMLE dataset and grounded on artificial contexts generated by PMC-LLaMA-13B. This model achieves a new state-of-the-art (SOTA) performance over the corresponding test set.

Model description

Performance

At the time of release (February 2024), MedGENIE-fid-flan-t5-base-medqa is a new lightweight SOTA model on MedQA-USMLE benchmark:

Model Ground (Source) Learning Params Accuracy (↓)
MedGENIE-FID-Flan-T5 G (PMC-LLaMA) Fine-tuned 250M 53.1
Codex (Liévin et al.) 0-zhot 175B 52.5
Codex (Liévin et al.) R (Wikipedia) 0-shot 175B 52.5
GPT-3.5-Turbo (Yang et al.) R (Wikipedia) k-shot -- 52.3
MEDITRON (Chen et al.) Fine-tuned 7B 52.0
BioMistral DARE (Labrak et al.) Fine-tuned 7B 51.1
BioMistral (Labrak et al.) Fine-tuned 7B 50.6
Zephyr-β R (MedWiki) 2-shot 7B 50.4
BioMedGPT (Luo et al.) k-shot 10B 50.4
BioMedLM (Singhal et al.) Fine-tuned 2.7B 50.3
PMC-LLaMA (awq 4 bit) Fine-tuned 13B 50.2
LLaMA-2 (Chen et al.) Fine-tuned 7B 49.6
Zephyr-β 2-shot 7B 49.6
Zephyr-β (Chen et al.) 3-shot 7B 49.2
PMC-LLaMA (Chen et al.) Fine-tuned 7B 49.2
DRAGON (Yasunaga et al.) R (UMLS) Fine-tuned 360M 47.5
InstructGPT (Liévin et al.) R (Wikipedia) 0-shot 175B 47.3
BioMistral DARE (Labrak et al.) 3-shot 7B 47.0
Flan-PaLM (Singhal et al.) 5-shot 62B 46.1
InstructGPT (Liévin et al.) 0-shot 175B 46.0
VOD (Liévin et al. 2023) R (MedWiki) Fine-tuned 220M 45.8
Vicuna 1.3 (Liévin et al.) 0-shot 33B 45.2
BioLinkBERT (Singhal et al.) Fine-tuned 340M 45.1
Mistral-Instruct R (MedWiki) 2-shot 7B 45.1
BioMistral (Labrak et al.) 3-shot 7B 44.4
Galactica 0-shot 120B 44.4
LLaMA-2 (Liévin et al.) 0-shot 70B 43.4
BioReader (Frisoni et al.) R (PubMed-RCT) Fine-tuned 230M 43.0
Guanaco (Liévin et al.) 0-shot 33B 42.9
LLaMA-2-chat (Liévin et al.) 0-shot 70B 42.3
Vicuna 1.5 (Liévin et al.) 0-shot 65B 41.6
Mistral-Instruct (Chen et al.) 3-shot 7B 41.1
PaLM (Singhal et al.) 5-shot 62B 40.9
Guanaco (Liévin et al.) 0-shot 65B 40.8
Falcon-Instruct (Liévin et al.) 0-shot 40B 39.0
Vicuna 1.3 (Liévin et al.) 0-shot 13B 38.7
GreaseLM (Zhang et al.) R (UMLS) Fine-tuned 359M 38.5
PubMedBERT (Singhal et al.) Fine-tuned 110M 38.1
QA-GNN (Yasunaga et al.) R (UMLS) Fine-tuned 360M 38.0
LLaMA-2 (Yang et al.) R (Wikipedia) k-shot 13B 37.6
LLaMA-2-chat R (MedWiki) 2-shot 7B 37.2
LLaMA-2-chat 2-shot 7B 37.2
BioBERT (Lee et al.) Fine-tuned 110M 36.7
MTP-Instruct (Liévin et al.) 0-shot 30B 35.1
GPT-Neo (Singhal et al.) Fine-tuned 2.5B 33.3
LLaMa-2-chat (Liévin et al.) 0-shot 13B 32.2
LLaMa-2 (Liévin et al.) 0-shot 13B 31.1
GPT-NeoX (Liévin et al.) 0-shot 20B 26.9

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • n_context: 5
  • per_gpu_batch_size: 1
  • accumulation_steps: 4
  • total_steps: 40,712
  • eval_freq: 10,178
  • optimizer: AdamW
  • scheduler: linear
  • weight_decay: 0.01
  • warmup_ratio: 0.1
  • text_maxlength: 1024

Bias, Risk and Limitation

Our model is trained on artificially generated contextual documents, which might inadvertently magnify inherent biases and depart from clinical and societal norms. This could lead to the spread of convincing medical misinformation. To mitigate this risk, we recommend a cautious approach: domain experts should manually review any output before real-world use. This ethical safeguard is crucial to prevent the dissemination of potentially erroneous or misleading information, particularly within clinical and scientific circles.

Citation

If you find MedGENIE-fid-flan-t5-base-medqa is useful in your work, please cite it with:

@misc{frisoni2024generate,
      title={To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering}, 
      author={Giacomo Frisoni and Alessio Cocchieri and Alex Presepi and Gianluca Moro and Zaiqiao Meng},
      year={2024},
      eprint={2403.01924},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
7
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train disi-unibo-nlp/MedGENIE-fid-flan-t5-base-medqa

Collection including disi-unibo-nlp/MedGENIE-fid-flan-t5-base-medqa