Arabic Flair + fastText Part-of-Speech tagging Model (Egyptian and Levant)

Pretrained Part-of-Speech tagging model built on a joint corpus written in Egyptian and Levantine (Jordanian, Lebanese, Palestinian, Syrian) dialects with code-switching of Egyptian Arabic and English. The model is trained using Flair (forward+backward)and fastText embeddings.

Pretraining Corpora:

This sequence labeling model was pretrained on three corpora jointly:

  1. 4 Dialects A Dialectal Arabic Datasets containing four dialects of Arabic, Egyptian (EGY), Levantine (LEV), Gulf (GLF), and Maghrebi (MGR). Each dataset consists of a set of 350 manually segmented and PoS tagged tweets.
  2. UD South Levantine Arabic MADAR A Dataset with 100 manually-annotated sentences taken from the MADAR (Multi-Arabic Dialect Applications and Resources) project by Shorouq Zahra.
  3. Parts of the Cairo Students Code-Switch (CSCS) corpus developed for "Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus" by Hamed et al.

Usage

from flair.data import Sentence
from flair.models import SequenceTagger
  
tagger = SequenceTagger.load("megantosh/flair-arabic-dialects-codeswitch-egy-lev")
sentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .')
tagger.predict(sentence)
for entity in sentence.get_spans('pos'):
    print(entity)

Due to the right-to-left in left-to-right context, some formatting errors might occur. and your code might appear like this, (link accessed on 2020-10-27)

Scores & Tagset

precision recall f1-score support
INTJ 0.8182 0.9000 0.8571 10
OUN 0.9009 0.9402 0.9201 435
NUM 0.9524 0.8333 0.8889 24
ADJ 0.8762 0.7603 0.8142 121
ADP 0.9903 0.9623 0.9761 106
CCONJ 0.9600 0.9730 0.9664 74
PROPN 0.9333 0.9333 0.9333 15
ADV 0.9135 0.8051 0.8559 118
VERB 0.8852 0.9231 0.9038 117
PRON 0.9620 0.9465 0.9542 187
SCONJ 0.8571 0.9474 0.9000 19
PART 0.9350 0.9791 0.9565 191
DET 0.9348 0.9149 0.9247 47
PUNCT 1.0000 1.0000 1.0000 35
AUX 0.9286 0.9811 0.9541 53
MENTION 0.9231 1.0000 0.9600 12
V 0.8571 0.8780 0.8675 82
FUT-PART+V+PREP+PRON 1.0000 0.0000 0.0000 1
PROG-PART+V+PRON+PREP+PRON 0.0000 1.0000 0.0000 0
ADJ+NSUFF 0.6111 0.8462 0.7097 26
NOUN+NSUFF 0.8182 0.8438 0.8308 64
PREP+PRON 0.9565 0.9565 0.9565 23
PUNC 0.9941 1.0000 0.9971 169
EOS 1.0000 1.0000 1.0000 70
NOUN+PRON 0.6986 0.8500 0.7669 60
V+PRON 0.7258 0.8036 0.7627 56
PART+PRON 1.0000 0.9474 0.9730 19
PROG-PART+V 0.8333 0.9302 0.8791 43
DET+NOUN 0.9625 1.0000 0.9809 77
NOUN+NSUFF+PRON 0.9091 0.7143 0.8000 14
PROG-PART+V+PRON 0.7083 0.9444 0.8095 18
PREP+NOUN+NSUFF 0.6667 0.4000 0.5000 5
NOUN+NSUFF+NSUFF 1.0000 0.0000 0.0000 3
CONJ 0.9722 1.0000 0.9859 35
V+PRON+PRON 0.6364 0.5833 0.6087 12
FOREIGN 0.6667 0.6667 0.6667 3
PREP+NOUN 0.6316 0.7500 0.6857 16
DET+NOUN+NSUFF 0.9000 0.9310 0.9153 29
DET+ADJ+NSUFF 1.0000 0.5714 0.7273 7
CONJ+PRON 1.0000 0.8750 0.9333 8
NOUN+CASE 0.0000 0.0000 0.0000 2
DET+ADJ 1.0000 0.6667 0.8000 6
PREP 1.0000 0.9718 0.9857 71
CONJ+FUT-PART+V 0.0000 0.0000 0.0000 1
CONJ+V 0.6667 0.7500 0.7059 8
FUT-PART 1.0000 1.0000 1.0000 2
ADJ+PRON 1.0000 0.0000 0.0000 8
CONJ+PREP+NOUN+PRON 1.0000 0.0000 0.0000 1
CONJ+NOUN+PRON 0.3750 1.0000 0.5455 3
PART+ADJ 1.0000 0.0000 0.0000 1
PART+NOUN 0.5000 1.0000 0.6667 1
CONJ+PREP+NOUN 1.0000 0.0000 0.0000 1
CONJ+NOUN 0.7000 0.7778 0.7368 9
URL 1.0000 1.0000 1.0000 3
CONJ+FUT-PART 1.0000 0.0000 0.0000 1
FUT-PART+V 0.8571 0.6000 0.7059 10
PREP+NOUN+NSUFF+NSUFF 1.0000 0.0000 0.0000 1
HASH 1.0000 0.9412 0.9697 17
ADJ+PREP+PRON 1.0000 0.0000 0.0000 3
PREP+NOUN+PRON 0.0000 0.0000 0.0000 1
EMOT 1.0000 0.8889 0.9412 18
CONJ+PREP 1.0000 0.7500 0.8571 4
PREP+DET+NOUN+NSUFF 1.0000 0.7500 0.8571 4
PRON+DET+NOUN+NSUFF 0.0000 1.0000 0.0000 0
V+PREP+PRON 1.0000 0.0000 0.0000 5
V+PRON+PREP+PRON 0.0000 1.0000 0.0000 0
CONJ+NOUN+NSUFF 0.5000 0.5000 0.5000 2
V+NEG-PART 1.0000 0.0000 0.0000 2
PREP+DET+NOUN 0.9091 1.0000 0.9524 10
PREP+V 1.0000 0.0000 0.0000 2
CONJ+PART 1.0000 0.7778 0.8750 9
CONJ+V+PRON 1.0000 1.0000 1.0000 5
PROG-PART+V+PREP+PRON 1.0000 0.5000 0.6667 2
PREP+NOUN+NSUFF+PRON 1.0000 1.0000 1.0000 1
ADJ+CASE 1.0000 0.0000 0.0000 1
PART+NOUN+PRON 1.0000 1.0000 1.0000 1
PART+V 1.0000 0.0000 0.0000 3
PART+V+PRON 0.0000 1.0000 0.0000 0
FUT-PART+V+PRON 0.0000 1.0000 0.0000 0
FUT-PART+V+PRON+PRON 1.0000 0.0000 0.0000 1
CONJ+PREP+PRON 1.0000 0.0000 0.0000 1
CONJ+V+PRON+PREP+PRON 1.0000 0.0000 0.0000 1
CONJ+V+PREP+PRON 0.0000 1.0000 0.0000 0
CONJ+DET+NOUN+NSUFF 1.0000 0.0000 0.0000 1
CONJ+DET+NOUN 0.6667 1.0000 0.8000 2
CONJ+PREP+DET+NOUN 1.0000 1.0000 1.0000 1
PREP+PART 1.0000 0.0000 0.0000 2
PART+V+PRON+NEG-PART 0.3333 0.3333 0.3333 3
PART+V+NEG-PART 0.3333 0.5000 0.4000 2
PART+PREP+NEG-PART 1.0000 1.0000 1.0000 3
PART+PROG-PART+V+NEG-PART 1.0000 0.3333 0.5000 3
PREP+DET+NOUN+NSUFF+PREP+PRON 1.0000 0.0000 0.0000 1
PREP+PRON+DET+NOUN 0.0000 1.0000 0.0000 0
PART+NSUFF 1.0000 0.0000 0.0000 1
CONJ+PROG-PART+V+PRON 1.0000 1.0000 1.0000 1
PART+PREP+PRON 1.0000 0.0000 0.0000 1
CONJ+PART+PREP 1.0000 0.0000 0.0000 1
NUM+NSUFF 0.6667 0.6667 0.6667 3
CONJ+PART+V+PRON+NEG-PART 1.0000 1.0000 1.0000 1
PART+NOUN+NEG-PART 1.0000 1.0000 1.0000 1
CONJ+ADJ+NSUFF 1.0000 0.0000 0.0000 1
PREP+ADJ 1.0000 0.0000 0.0000 1
ADJ+NSUFF+PRON 1.0000 0.0000 0.0000 2
CONJ+PROG-PART+V 1.0000 0.0000 0.0000 1
CONJ+PART+PROG-PART+V+PREP+PRON+NEG-PART 1.0000 0.0000 0.0000 1
CONJ+PART+PREP+PRON+NEG-PART 0.0000 1.0000 0.0000 0
PREP+PART+PRON 1.0000 0.0000 0.0000 1
CONJ+ADV+NSUFF 1.0000 0.0000 0.0000 1
CONJ+ADV 0.0000 1.0000 0.0000 0
PART+NOUN+PRON+NEG-PART 0.0000 1.0000 0.0000 0
CONJ+ADJ 1.0000 1.0000 1.0000 1
  • F-score (micro): 0.8974
  • F-score (macro): 0.5188
  • Accuracy (incl. no class): 0.901

Expand details below to show class scores for each tag. Note that tag compounds (a tag made for multiple agglutinated parts of speech) are considered as separate ones.

Citation

if you use this model, please consider citing this work:

@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
year = {2021},
doi = "10.13140/RG.2.2.34961.10084"
url = {https://www.researchgate.net/publication/358956953_Sequence_Labeling_Architectures_in_Diglossia_-_a_case_study_of_Arabic_and_its_dialects}
}
Downloads last month
3
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.