Project Description

This repository contains the trained model for our paper: Fine-tuning a Sentence Transformer for DNA & Protein tasks that is currently under review at BMC Bioinformatics. This model, called simcse-dna; is based on the original implementation of SimCSE [1]. The original model was adapted for DNA downstream tasks by training it on a small sample size k-mer tokens generated from the human reference genome, and can be used to generate sentence embeddings for DNA tasks.

Prerequisites


Please see the original SimCSE for installation details. The model will als be hosted on Zenodo (DOI: 10.5281/zenodo.11046580).

Usage

Run the following code to get the sentence embeddings:


import torch
from transformers import AutoModel, AutoTokenizer

# Import trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dsfsi/simcse-dna")
model = AutoModel.from_pretrained("dsfsi/simcse-dna")


#sentences is your list of n DNA tokens of size 6 
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output

The retrieved embeddings can be utilized as input for a machine learning classifier to perform classification.

Performance on evaluation tasks

Find out more about the datasets and access in the paper (TBA)

Task 1: Detection of colorectal cancer cases (after oversampling)

5-fold Cross Validation accuracy Test accuracy
LightGBM 91 63
Random Forest 94 71
XGBoost 93 66
CNN 42 52
5-fold Cross Validation F1 Test F1
LightGBM 91 66
Random Forest 94 72
XGBoost 93 66
CNN 41 60

Task 2: Prediction of the Gleason grade group (after oversampling)

5-fold Cross Validation accuracy Test accuracy
LightGBM 97 68
Random Forest 98 78
XGBoost 97 70
CNN 35 50
5-fold Cross Validation F1 Test F1
LightGBM 97 70
Random Forest 98 80
XGBoost 97 70
CNN 33 59

Task 3: Detection of human TATA sequences (after oversampling)

5-fold Cross Validation accuracy Test accuracy
LightGBM 98 93
Random Forest 99 96
XGBoost 99 95
CNN 38 59
5-fold Cross Validation F1 Test F1
LightGBM 98 92
Random Forest 99 95
XGBoost 99 92
CNN 58 10

Authors


  • Mpho Mokoatle, Vukosi Marivate, Darlington Mapiye, Riana Bornman, Vanessa M. Hayes
  • Contact details : [email protected]

Citation


Bibtex Reference TBA

References

[1] Gao, Tianyu, Xingcheng Yao, and Danqi Chen. "Simcse: Simple contrastive learning of sentence embeddings." arXiv preprint arXiv:2104.08821 (2021).

Downloads last month
22
Inference API
Unable to determine this model’s pipeline type. Check the docs .