--- license: apache-2.0 language: - ar pipeline_tag: text-classification datasets: - labr widget: - text: من أفضل الكتب التي قرأتها في هذا العام example_title: Positive - text: الكتاب سيء، لا أنصح أحد بقراءته أبدا example_title: Negative - text: لا يمكنك الجزم بشيء حول هذا الكتاب example_title: Neutral metrics: - precision - recall - f1 library_name: transformers tags: - code - sentiment analysis - sentiment-analysis --- # Introduction This model predicts the sentiment of a text if it is Positive, Neutral, or Negative. This model is a finetune version of [UBC-NLP/MARBERTv2](https://huggingface.co./UBC-NLP/MARBERTv2) on [labr](https://huggingface.co./datasets/labr). # Data The data used is [labr](https://huggingface.co./datasets/labr), an Arabic book reviews dataset. The sentiment is obtained from the number of stars given by each review. | Nubmer of stars | Sentiment | |-----------------|-----------| | 1-2 | Negative | | 3 | Neutral | | 4-5 | Positive | # Training Using the Arabic Pre-Trained [MARBERTv2](https://huggingface.co./UBC-NLP/MARBERTv2) as a base, we finetuned the model for a classification task. For 3 epochs, the training has been done using huggingface trainer on Google Colab. This is a POC experiment, so the training hyper-parameters were not optimized. # Evaluation Using the test set from [labr](https://huggingface.co./datasets/labr), and the same preprocessing steps, the model was evaluated. Please note the for the following results, we obtained the macro average. | Metric | Score | |-----------------|-----------| | Precision | 0.663 | | Recall | 0.662 | | F1 | 0.66 | # Using the model To use the model in your code, follow huggingface instructions, or ```python from transformers import pipeline pipe = pipeline("text-classification", model="AbdallahNasir/book-review-sentiment-classification") result = pipe("من أفضل الكتب التي قرأتها في هذا العام") print(result) ``` # Training code Following this code, you will get the same results I got. You can run it in Google Colab. Please use a GPU runtime to finish the training quickly. ```python # Notebook only: !pip install transformers[torch] datasets # Download and load the data import datasets dataset = datasets.load_dataset("labr") # Transform the ratings into Sentiment POSITIVE = "Positive" NEUTRAL = "Neutral" NEGATIVE = "Negative" rate_to_sentiment = {0: NEGATIVE, 1: NEGATIVE, 2: NEUTRAL, 3: POSITIVE, 4: POSITIVE} dataset = dataset.map(lambda example: {"sentiment": rate_to_sentiment[example["label"]]}, remove_columns=["label"]) dataset = dataset.rename_column("sentiment", "label") class_names = [POSITIVE, NEUTRAL, NEGATIVE] num_classes = len(class_names) dataset = dataset.cast_column('label', datasets.ClassLabel(num_classes=num_classes, names=class_names)) # Download and load the pre-trained model and tokenizer from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/MARBERTv2") model = AutoModelForSequenceClassification.from_pretrained("UBC-NLP/MARBERTv2", num_labels=3) # Tokenize data for training def tokenize_function(examples): return tokenizer(examples["text"], truncation=True, return_length=True,return_attention_mask=True, max_length=512) tokenized_datasets = dataset.map(tokenize_function, batched=False, num_proc=16) # Define data collator, useful for training and batching. from transformers import DataCollatorWithPadding data_collator = DataCollatorWithPadding(tokenizer=tokenizer) # Defining training args from transformers import TrainingArguments, Trainer training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch") from transformers import Trainer trainer = Trainer( model, training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["test"], data_collator=data_collator, tokenizer=tokenizer, ) # Train and save trainer.train() trainer.save_model("final_output") ``` ##### Keywords * sentiment analysis * arabic * book reviews