BERT5urk
This repository hosts the new 1.42B Turkish T5 model named BERT5urk.
BERT5urk is part of the Turkish Model Zoo family and pretrained using the awesome T5X library with the UL2 objective.
Inspired by the great Finnish T5 and UL2 models from the Finnish NLP group, BERT5urk also uses UL2 and the efficient T5 architecture, that is proposed in the "Scale Efficiently" paper. Many thanks to the Finnish NLP group for open-sourcing the pretraining code and models!
Pretraining Data
BERT5urk uses the Turkish part of the amazing FineWeb2 corpus. Only documents with a higher language score than 0.99 are chosen for final pretraining corpus, that has a total size of 262GB.
We train a SPM-based vocab on a 3GB corpus from randomly chosen documents of the pretraining corpus.
Pretraining
BERT5urk was pretrained with the awesome T5X library. Some pretraining highlights:
- One-shot pretraining (pretraining without any training crashes) was possible a v3-32 TPU Pod and took 16.56 days
- Model was pretrained for 2M steps for an input & output sequence length of 512 and a batch size of 128
- The resulting model has 1.42B parameters
Evaluation
Detailed evaluations can be found in the Turkish Model Zoo repository. Additionally, we also fine-tuned TURNA models as it is another T5 model with 1.14B parameters for comparison.
Encoder-only Results
For experiments on named entity recognition (NER) and part-of-speech (PoS) tagging we also the awesome Flair library and fine-tune only the encoder of BERT5urk and TURNA. The overall performance can be seen in the following table:
Model Name | Overall Development | Overall Test |
---|---|---|
BERTurk (cased, 128k) | 89.72 | 90.05 |
BERTurk (uncased, 128k) | 89.25 | 89.95 |
BERTurk (cased, 32k) | 88.98 | 89.49 |
BERTurk (uncased, 32k) | 89.28 | 89.67 |
ConvBERTurk (cased) | 90.06 | 90.27 |
ConvBERTurk mC4 (cased) | 90.03 | 90.09 |
ConvBERTurk mC4 (uncased) | 89.76 | 89.97 |
DistilBERTurk (cased) | 87.95 | 88.16 |
ELECTRA Base (cased) | 89.08 | 89.91 |
ELECTRA Base mC4 (cased) | 89.24 | 90.03 |
ELECTRA Base mC4 (uncased) | 89.09 | 89.62 |
ELECTRA Small (cased) | 87.27 | 88.28 |
BERT5urk | 89.96 | 90.26 |
TURNA | 88.81 | 89.36 |
Encoder-decoder Results
We tried to replicate the results from the TURNA paper using the TURNA fine-tuning library.
Paraphrasing - Tatoeba
We fine-tune five different models for both TURNA and BERT5urk with different seeds and report the average score. Additionally the score from the TURNA paper is also shown in the following table:
Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
---|---|---|---|---|---|
TURNA (paper) | 90.22 | 80.23 | 88.95 | 71.14 | 87.56 |
TURNA (replicated) | 90.36 | 80.50 | 89.10 | 71.48 | 87.63 |
BERT5urk | 90.47 | 80.78 | 89.21 | 71.89 | 87.74 |
Paraphrasing - OpenSubtitles
We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):
Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
---|---|---|---|---|---|
TURNA (paper) | 78.43 | 63.58 | 76.81 | 51.47 | 74.79 |
TURNA (replicated) | 78.36 | 63.42 | 76.71 | 51.39 | 74.94 |
BERT5urk | 78.56 | 63.80 | 76.95 | 51.74 | 75.07 |
Title Generation - TrNews
We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):
Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
---|---|---|---|---|---|
TURNA (paper) | 36.47 | 22.88 | 35.47 | 12.64 | 23.62 |
TURNA (replicated) | 41.65 | 27.60 | 36.77 | 18.60 | 34.55 |
BERT5urk | 41.79 | 27.77 | 37.00 | 19.08 | 34.69 |
Summarization - TrNews
We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):
Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
---|---|---|---|---|---|
TURNA (paper) | 41.77 | 27.81 | 36.99 | 19.05 | 34.61 |
TURNA (replicated) | 40.75 | 26.82 | 35.88 | 18.00 | 33.91 |
BERT5urk | 41.00 | 27.08 | 36.24 | 18.78 | 23.96 |
Acknowledgments
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs over many years ❤️
Made from Bavarian Oberland with ❤️ and 🥨.
- Downloads last month
- 92