File size: 4,813 Bytes
14f3a30
 
9062a4a
 
 
020aca2
f147861
9062a4a
 
 
 
5d7028d
 
 
 
020aca2
5d7028d
c0b773a
98c8b99
35b7c86
 
 
020aca2
5d7028d
 
 
020aca2
5d7028d
 
020aca2
5d7028d
bca1b93
5d7028d
 
 
 
 
020aca2
5d7028d
020aca2
5d7028d
020aca2
 
5d7028d
020aca2
 
 
5d7028d
020aca2
 
5d7028d
 
7c7a7fb
 
bb1e53c
 
 
 
 
 
 
 
 
 
 
 
 
7c7a7fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb1e53c
 
7c7a7fb
 
 
 
 
 
 
 
 
bbb5d3d
7c7a7fb
 
 
 
 
bb1e53c
 
 
 
 
5d7028d
bb1e53c
5d7028d
7c7a7fb
06cb996
bb1e53c
06cb996
 
c0b773a
 
 
 
be93cf7
 
c0b773a
 
 
 
06cb996
 
 
020aca2
5d7028d
020aca2
5d7028d
 
 
 
 
 
 
020aca2
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
license: cc-by-4.0
datasets:
- dsfsi/vukuzenzele-monolingual
- nchlt
- dsfsi/PuoData
- dsfsi/gov-za-monolingual
language:
- tn
library_name: transformers
pipeline_tag: fill-mask
tags:
- masked langauge model
- setswana
---
# PuoBerta: A curated Setswana Language Model

[![Zenodo doi badge](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.8434795-blue.svg)](https://doi.org/10.5281/zenodo.8434795) [![arXiv](https://img.shields.io/badge/arXiv-2310.09141-b31b1b.svg)](https://arxiv.org/abs/2310.09141) 🤗 [https://huggingface.co./dsfsi/PuoBERTa](https://huggingface.co./dsfsi/PuoBERTa)


Give Feedback 📑: [DSFSI Resource Feedback Form](https://docs.google.com/forms/d/e/1FAIpQLSf7S36dyAUPx2egmXbFpnTBuzoRulhL5Elu-N1eoMhaO7v10w/formResponse)

A Roberta-based language model specially designed for Setswana, using the new PuoData dataset.

## Model Details


### Model Description

This is a masked language model trained on Setswana corpora, making it a valuable tool for a range of downstream applications from translation to content creation. It's powered by the PuoData dataset to ensure accuracy and cultural relevance.

- **Developed by:** Vukosi Marivate ([@vukosi](https://huggingface.co./@vukosi)), Moseli Mots'Oehli ([@MoseliMotsoehli](https://huggingface.co./@MoseliMotsoehli)) , Valencia Wagner, Richard Lastrucci and Isheanesu Dzingirai
- **Model type:** RoBERTa Model
- **Language(s) (NLP):** Setswana
- **License:** CC BY 4.0


### Usage

Use this model filling in masks or finetune for downstream tasks. Here’s a simple example for masked prediction:

```python
from transformers import RobertaTokenizer, RobertaModel

# Load model and tokenizer
model = RobertaModel.from_pretrained('dsfsi/PuoBERTa')
tokenizer = RobertaTokenizer.from_pretrained('dsfsi/PuoBERTa')

```
 
### Downstream Use 

## Downstream Performance

### Daily News Dikgang

Learn more about the dataset in the [Dataset Folder](daily-news-dikgang)

| **Model**                   | **5-fold Cross Validation F1**       | **Test F1**       |
|-----------------------------|--------------------------------------|-------------------|
| Logistic Regression + TFIDF | 60.1                                 | 56.2              |
| NCHLT TSN RoBERTa           | 64.7                                 | 60.3              |
| PuoBERTa                    | **63.8**                             | **62.9**          |
| PuoBERTaJW300               | 66.2                                 | 65.4              |

Downstream News Categorisation model 🤗 [https://huggingface.co./dsfsi/PuoBERTa-News](https://huggingface.co./dsfsi/PuoBERTa-News)

### MasakhaPOS

Performance of models on the MasakhaPOS downstream task.

| Model | Test Performance |
|---|---|
| **Multilingual Models** |  |
| AfroLM | 83.8 |
| AfriBERTa | 82.5 |
| AfroXLMR-base | 82.7 |
| AfroXLMR-large | 83.0 |
| **Monolingual Models** |  |
| NCHLT TSN RoBERTa | 82.3 |
| PuoBERTa | **83.4** |
| PuoBERTa+JW300 | 84.1 |

Downstream POS model 🤗 [https://huggingface.co./dsfsi/PuoBERTa-POS](https://huggingface.co./dsfsi/PuoBERTa-POS)

### MasakhaNER

Performance of models on the MasakhaNER downstream task.

| Model | Test Performance (f1 score) |
|---|---|
| **Multilingual Models** |  |
| AfriBERTa | 83.2 |
| AfroXLMR-base | 87.7 |
| AfroXLMR-large | 89.4 |
| **Monolingual Models** |  |
| NCHLT TSN RoBERTa | 74.2 |
| PuoBERTa | **78.2** |
| PuoBERTa+JW300 | 80.2 |

Downstream NER model 🤗 [https://huggingface.co./dsfsi/PuoBERTa-NER](https://huggingface.co./dsfsi/PuoBERTa-NER)

## Pre-Training Dataset

We used the PuoData dataset, a rich source of Setswana text, ensuring that our model is well-trained and culturally attuned.

[Github](https://github.com/dsfsi/PuoData), 🤗 [https://huggingface.co./datasets/dsfsi/PuoData](https://huggingface.co./datasets/dsfsi/PuoData)

## Citation Information

Bibtex Reference

```
@inproceedings{marivate2023puoberta,
  title   = {PuoBERTa: Training and evaluation of a curated language model for Setswana},
  author  = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
  year    = {2023},
  booktitle= {Artificial Intelligence Research. SACAIR 2023. Communications in Computer and Information Science},
  url= {https://link.springer.com/chapter/10.1007/978-3-031-49002-6_17},
  keywords = {NLP},
  preprint_url = {https://arxiv.org/abs/2310.09141},
  dataset_url = {https://github.com/dsfsi/PuoBERTa},
  software_url = {https://huggingface.co./dsfsi/PuoBERTa}
}
```

## Contributing

Your contributions are welcome! Feel free to improve the model.

## Model Card Authors

Vukosi Marivate

## Model Card Contact

For more details, reach out or check our [website](https://dsfsi.github.io/).

Email: [email protected]

**Enjoy exploring Setswana through AI!**