WatiBERT: Fine-Tuned BERT Model for French Rap Lyrics

Overview

WatiBERT is a BERT model fine-tuned on french rap lyrics sourced from Genius. Used dataset size was 323MB, corresponding to 85M tokens after tokenization.

This model is designed to understand and analyze the semantic relationships within the context of French rap, providing a valuable tool for research in French slang, and music lyrics analysis.

Model Details

The model is based on the FlauBERT Large Cased architecture and has been fine-tuned with the following hyperparameters:

Parameter Value
Epochs 5
Train Batch Size 16
Learning Rate 2e-5
Weight Decay 0.01
Warmup Ratio 0.1
Dropout 0.1

Versions

The model was trained using AWS SageMaker on a single ml.p3.2xlarge instance with the following software versions:

Requirement Version
Transformers Library 4.6
PyTorch 1.7
Python 3.6

Installation

Install Required Python Libraries:

pip install transformers

Loading the Model

To load the WatiBERT model, use the following Python code:

from transformers import FlaubertTokenizer, FlaubertWithLMHeadModel

# Load the tokenizer and model
tokenizer = FlaubertTokenizer.from_pretrained("rapminerz/WatiBERT-large-cased")
model = FlaubertWithLMHeadModel.from_pretrained("rapminerz/WatiBERT-large-cased")

Using the Model

BERT Models being masked-models, you can fill missing words to check it out

def fill_mask(sentence, topk):
    inputs = tokenizer(sentence, return_tensors="pt")
    mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
    outputs = model(**inputs)
    logits = outputs.logits
    top_tokens_ids = logits[0, mask_token_index, :].topk(topk, dim=1).indices[0]
    top_tokens = [tokenizer.decode(token_id) for token_id in top_tokens_ids]
    return top_tokens

sentence = "La <special1> est morte hier, ils faisaient pas le poids (gang)"
fill_mask(sentence, 1)
['concurrence']

sentence = "On s'en souviendra comme le coup de tête de <special1>..."
fill_mask(sentence, 1)
['Zidane']

sentence = "Et quand je serai en haut j'achêterai une <special1> à ma daronne !"
fill_mask(sentence, 1)
['villa']

sentence = "Tout ce qui m'importe c'est faire du <special1> !"
fill_mask(sentence, 5)
['chiffre', 'cash', 'fric', 'sale', 'blé']

Usages

This model can be then fined tune to serveral tasks such as : text classification, named entity recognition, question answering, text summerization, text generation, text completion, paraphrasing, language translation, sentiment analysis...

Purpose and Disclaimer

This model is designed for academic and research purposes only. It is not intended for commercial use. The creators of this model do not endorse or promote any specific views or opinions that may be represented in the dataset.

Please mention @RapMinerz if you use our models

Contact

For any questions or issues, please contact the repository owner, RapMinerz, at [email protected].

Downloads last month
6
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.