File size: 4,813 Bytes
14f3a30 9062a4a 020aca2 f147861 9062a4a 5d7028d 020aca2 5d7028d c0b773a 98c8b99 35b7c86 020aca2 5d7028d 020aca2 5d7028d 020aca2 5d7028d bca1b93 5d7028d 020aca2 5d7028d 020aca2 5d7028d 020aca2 5d7028d 020aca2 5d7028d 020aca2 5d7028d 7c7a7fb bb1e53c 7c7a7fb bb1e53c 7c7a7fb bbb5d3d 7c7a7fb bb1e53c 5d7028d bb1e53c 5d7028d 7c7a7fb 06cb996 bb1e53c 06cb996 c0b773a be93cf7 c0b773a 06cb996 020aca2 5d7028d 020aca2 5d7028d 020aca2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
---
license: cc-by-4.0
datasets:
- dsfsi/vukuzenzele-monolingual
- nchlt
- dsfsi/PuoData
- dsfsi/gov-za-monolingual
language:
- tn
library_name: transformers
pipeline_tag: fill-mask
tags:
- masked langauge model
- setswana
---
# PuoBerta: A curated Setswana Language Model
[](https://doi.org/10.5281/zenodo.8434795) [](https://arxiv.org/abs/2310.09141) 🤗 [https://huggingface.co./dsfsi/PuoBERTa](https://huggingface.co./dsfsi/PuoBERTa)
Give Feedback 📑: [DSFSI Resource Feedback Form](https://docs.google.com/forms/d/e/1FAIpQLSf7S36dyAUPx2egmXbFpnTBuzoRulhL5Elu-N1eoMhaO7v10w/formResponse)
A Roberta-based language model specially designed for Setswana, using the new PuoData dataset.
## Model Details
### Model Description
This is a masked language model trained on Setswana corpora, making it a valuable tool for a range of downstream applications from translation to content creation. It's powered by the PuoData dataset to ensure accuracy and cultural relevance.
- **Developed by:** Vukosi Marivate ([@vukosi](https://huggingface.co./@vukosi)), Moseli Mots'Oehli ([@MoseliMotsoehli](https://huggingface.co./@MoseliMotsoehli)) , Valencia Wagner, Richard Lastrucci and Isheanesu Dzingirai
- **Model type:** RoBERTa Model
- **Language(s) (NLP):** Setswana
- **License:** CC BY 4.0
### Usage
Use this model filling in masks or finetune for downstream tasks. Here’s a simple example for masked prediction:
```python
from transformers import RobertaTokenizer, RobertaModel
# Load model and tokenizer
model = RobertaModel.from_pretrained('dsfsi/PuoBERTa')
tokenizer = RobertaTokenizer.from_pretrained('dsfsi/PuoBERTa')
```
### Downstream Use
## Downstream Performance
### Daily News Dikgang
Learn more about the dataset in the [Dataset Folder](daily-news-dikgang)
| **Model** | **5-fold Cross Validation F1** | **Test F1** |
|-----------------------------|--------------------------------------|-------------------|
| Logistic Regression + TFIDF | 60.1 | 56.2 |
| NCHLT TSN RoBERTa | 64.7 | 60.3 |
| PuoBERTa | **63.8** | **62.9** |
| PuoBERTaJW300 | 66.2 | 65.4 |
Downstream News Categorisation model 🤗 [https://huggingface.co./dsfsi/PuoBERTa-News](https://huggingface.co./dsfsi/PuoBERTa-News)
### MasakhaPOS
Performance of models on the MasakhaPOS downstream task.
| Model | Test Performance |
|---|---|
| **Multilingual Models** | |
| AfroLM | 83.8 |
| AfriBERTa | 82.5 |
| AfroXLMR-base | 82.7 |
| AfroXLMR-large | 83.0 |
| **Monolingual Models** | |
| NCHLT TSN RoBERTa | 82.3 |
| PuoBERTa | **83.4** |
| PuoBERTa+JW300 | 84.1 |
Downstream POS model 🤗 [https://huggingface.co./dsfsi/PuoBERTa-POS](https://huggingface.co./dsfsi/PuoBERTa-POS)
### MasakhaNER
Performance of models on the MasakhaNER downstream task.
| Model | Test Performance (f1 score) |
|---|---|
| **Multilingual Models** | |
| AfriBERTa | 83.2 |
| AfroXLMR-base | 87.7 |
| AfroXLMR-large | 89.4 |
| **Monolingual Models** | |
| NCHLT TSN RoBERTa | 74.2 |
| PuoBERTa | **78.2** |
| PuoBERTa+JW300 | 80.2 |
Downstream NER model 🤗 [https://huggingface.co./dsfsi/PuoBERTa-NER](https://huggingface.co./dsfsi/PuoBERTa-NER)
## Pre-Training Dataset
We used the PuoData dataset, a rich source of Setswana text, ensuring that our model is well-trained and culturally attuned.
[Github](https://github.com/dsfsi/PuoData), 🤗 [https://huggingface.co./datasets/dsfsi/PuoData](https://huggingface.co./datasets/dsfsi/PuoData)
## Citation Information
Bibtex Reference
```
@inproceedings{marivate2023puoberta,
title = {PuoBERTa: Training and evaluation of a curated language model for Setswana},
author = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
year = {2023},
booktitle= {Artificial Intelligence Research. SACAIR 2023. Communications in Computer and Information Science},
url= {https://link.springer.com/chapter/10.1007/978-3-031-49002-6_17},
keywords = {NLP},
preprint_url = {https://arxiv.org/abs/2310.09141},
dataset_url = {https://github.com/dsfsi/PuoBERTa},
software_url = {https://huggingface.co./dsfsi/PuoBERTa}
}
```
## Contributing
Your contributions are welcome! Feel free to improve the model.
## Model Card Authors
Vukosi Marivate
## Model Card Contact
For more details, reach out or check our [website](https://dsfsi.github.io/).
Email: [email protected]
**Enjoy exploring Setswana through AI!** |