Model Card for DistilBERT Sentiment Analysis on IMDB
This model card describes a sentiment analysis model fine-tuned on the IMDB movie reviews dataset. The model uses a distilled version of BERT (DistilBERT) and is designed to classify reviews as either "positive" or "negative." It is intended for research and prototyping purposes.
Model Details
Model Description
This model is a fine-tuned version of the pre-trained distilbert-base-uncased
model from Hugging Face. It has been adapted to perform binary sentiment classification (positive vs. negative) using the IMDB movie reviews dataset. The fine-tuning process leverages the Hugging Face Trainer API and custom evaluation metrics to optimize performance on sentiment analysis tasks.
- Developed by: Aygün Varol
- Model type: DistilBERT-based sentiment classifier
- Language(s) (NLP): English
- License: MIT License
- Finetuned from model:
distilbert-base-uncased
Model Sources
- Repository: Link to GitHub repository
- Paper: Sanh et al., 2019, "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter"
- Demo: Link to a demo
Uses
Direct Use
This model can be used directly for sentiment analysis of English text. Given an input review, it predicts whether the sentiment is positive or negative. It is suitable for prototyping applications such as movie review analysis, social media sentiment monitoring, or customer feedback analysis.
Downstream Use
For applications that require additional customization (e.g., domain-specific sentiment analysis), this model can serve as a starting point for further fine-tuning or integration into larger natural language processing pipelines.
Out-of-Scope Use
- Non-English text: The model is fine-tuned on English reviews and may not perform well on text in other languages.
- Multiclass sentiment analysis: This model is binary (positive/negative) and is not designed for more granular sentiment or emotion detection.
- Highly specialized domains: The IMDB dataset represents movie reviews, so performance might degrade on texts from very different domains without further fine-tuning.
Bias, Risks, and Limitations
The IMDB dataset, like many public datasets, may contain inherent biases that can be reflected in the model predictions. For example:
- Data Bias: The training data may overrepresent certain types of reviews or sentiments.
- Generalization: The model may not generalize well to other domains (e.g., product reviews, social media posts) without additional fine-tuning.
- Ethical Risks: Misclassification in critical applications (e.g., automated moderation) may lead to undesired consequences. Users should evaluate the model carefully before deployment.
Recommendations
Users should:
- Be aware of the biases present in the training data.
- Consider further fine-tuning on domain-specific datasets.
- Evaluate the model's performance on their specific data before using it in production environments.
How to Get Started with the Model
To use the model, install the transformers
library and load the model as follows:
from transformers import pipeline
sentiment_classifier = pipeline("sentiment-analysis", model="Aygun/finetuned_distilbert_imdb")
result = sentiment_classifier("This movie was fantastic!")
print(result)
Training Details
Training Data
The model was fine-tuned on the IMDB movie reviews dataset. This dataset contains 50,000 movie reviews split evenly into positive and negative classes. Training Procedure
The model was fine-tuned using the Hugging Face Trainer API with the following setup:
Preprocessing: Reviews were tokenized with truncation and padding to a maximum sequence length of 256 tokens.
Hyperparameters: The model was fine-tuned with a learning rate of 5e-5, a batch size of 16, and for 1 epoch (for demonstration purposes). For full training, more epochs and possibly different hyperparameters should be considered.
Evaluation: The model’s performance was evaluated using accuracy, precision, recall, and F1-score metrics.
- Downloads last month
- 13