VetBERT Disease Syndrome Classifier

This is a finetuned version of the VetBERT model, designed to classify the disease syndrome within a veterinary clinical note.

This pretrained model is designed for performing NLP tasks related to veterinary clinical notes. The Domain Adaptation and Instance Selection for Disease Syndrome Classification over Veterinary Clinical Notes (Hur et al., BioNLP 2020) paper introduced VetBERT model: an initialized Bert Model with ClinicalBERT (Bio+Clinical BERT) and further pretrained on the VetCompass Australia corpus for performing tasks specific to veterinary medicine.

Pretraining Data

The VetBERT model was initialized from Bio_ClinicalBERT model, which was initialized from BERT. The VetBERT model was trained on over 15 million veterinary clincal Records and 1.3 Billion tokens.

Pretraining Hyperparameters

During the pretraining phase for VetBERT, we used a batch size of 32, a maximum sequence length of 512, and a learning rate of 5 · 10−5. The dup factor for duplicating input data with different masks was set to 5. All other default parameters were used (specifically, masked language model probability = 0.15 and max predictions per sequence = 20).

VetBERT Finetuning

VetBERT was further finetuned on a set of 5002 annotated clinical notes to classifiy the disease syndrome associated with the clinical notes as outlined in the paper: Domain Adaptation and Instance Selection for Disease Syndrome Classification over Veterinary Clinical Notes

How to use the model

Load the model via the transformers library:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the tokenizer and model from the Hugging Face Hub
model_name = 'havocy28/VetBERTDx'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example text to classify
text = "Hx: 7 yo canine with history of vomiting intermittently since yesterday. No other concerns. Still eating and drinking normally. cPL negative."

# Encode the text and prepare inputs for the model
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

# Predict and compute softmax to get probabilities
with torch.no_grad():
    logits = model(**inputs).logits
    probabilities = torch.softmax(logits, dim=-1)

# Retrieve label mapping from model's configuration
label_map = model.config.id2label

# Combine labels and probabilities, and sort by probability in descending order
sorted_probs = sorted(((prob.item(), label_map[idx]) for idx, prob in enumerate(probabilities[0])), reverse=True, key=lambda x: x[0])

# Display sorted probabilities and labels
for prob, label in sorted_probs:
    print(f"{label}: {prob:.4f}")

Citation

Please cite this article: Brian Hur, Timothy Baldwin, Karin Verspoor, Laura Hardefeldt, and James Gilkerson. 2020. Domain Adaptation and Instance Selection for Disease Syndrome Classification over Veterinary Clinical Notes. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pages 156–166, Online. Association for Computational Linguistics.