Fine-tuned XLM-RoBERTa for Toxicity Classification in Spanish

This is a fine-tuned XLM-RoBERTa trained using as a base model XLM-RoBERTa base-sized pre-trained on 2.5TB of filtered CommonCrawl data that comprises 100 languages. The dataset for training this model is a gold standard for protest events for toxicity and incivility in Spanish.

The dataset comprises ~5M data points from three Latin American protest events: (a) protests against the coronavirus and judicial reform measures in Argentina during August 2020; (b) protests against education budget cuts in Brazil in May 2019; and (c) the social outburst in Chile stemming from protests against the underground fare hike in October 2019. We are focusing on interactions in Spanish to elaborate a gold standard for digital interactions in this language, therefore, we prioritise Argentinian and Chilean data.

Labels: NONTOXIC and TOXIC.

Example of Classification

## Pipeline as a high-level helper
from transformers import pipeline
toxic_classifier = pipeline("text-classification", model="bgonzalezbustamante/ft-xlm-roberta-toxicity")

## Non-toxic example
non_toxic = toxic_classifier("Que tengas un excelente día :)")

## Toxic example
toxic = toxic_classifier("Eres un maldito infeliz")

## Print examples
print(non_toxic)
print(toxic)

Output:

[{'label': 'NONTOXIC', 'score': 0.5529471635818481}]
[{'label': 'TOXIC', 'score': 0.6219274401664734}]

Validation Metrics

  • Accuracy: 0.740
  • Precision: 0.688
  • Recall: 0.924
  • F1-Score: 0.789
Downloads last month
1,236
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for bgonzalezbustamante/ft-xlm-roberta-toxicity

Finetuned
(2698)
this model

Dataset used to train bgonzalezbustamante/ft-xlm-roberta-toxicity

Collection including bgonzalezbustamante/ft-xlm-roberta-toxicity