opus-mt-tc-bible-big-afa-deu_eng_fra_por_spa

Model Details
Uses
Risks, Limitations and Biases
How to Get Started With the Model
Training
Evaluation
Citation Information
Acknowledgements

Model Details

Neural machine translation model for translating from Afro-Asiatic languages (afa) to unknown (deu+eng+fra+por+spa).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

Developed by: Language Technology Research Group at the University of Helsinki
Model Type: Translation (transformer-big)
Release: 2024-05-29
License: Apache-2.0
Language(s):
- Source Language(s): aar acm afb amh apc ara arc arq arz bcw byn cop daa dsh gde gnd hau hbo heb hig irk jpa kab ker kqp ktb kxc lln lme meq mfh mfi mfk mif mlt mpg mqb muy oar orm pbi phn rif sgw shi shy som sur syc syr taq thv tig tir tmc tmh tmr ttr tzm wal xed zgh
- Target Language(s): deu eng fra por spa
- Valid Target Language Labels: >>deu<< >>eng<< >>fra<< >>por<< >>spa<< >>xxx<<
Original Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.zip
Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>deu<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>eng<< Anta i ak-d-yennan ur yerbiḥ ara Tom?",
    ">>fra<< Iselman d aɣbalu axatar i wučči n yemdanen."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-afa-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     Who told you that he didn't?
#     L'eau est une source importante de nourriture pour les gens.

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-afa-deu_eng_fra_por_spa")
print(pipe(">>eng<< Anta i ak-d-yennan ur yerbiḥ ara Tom?"))

# expected output: Who told you that he didn't?

Training

Data: opusTCv20230926max50+bt+jhubc (source)
Pre-processing: SentencePiece (spm32k,spm32k)
Model Type: transformer-big
Original MarianNMT Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.zip
Training Scripts: GitHub Repo

Evaluation

Model scores at the OPUS-MT dashboard
test set translations: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.test.txt
test set scores: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.eval.txt
benchmark results: benchmark_results.txt
benchmark output: benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
ara-deu	tatoeba-test-v2021-08-07	0.61039	41.7	1209	8371
ara-eng	tatoeba-test-v2021-08-07	5.430	0.0	10305	76975
ara-fra	tatoeba-test-v2021-08-07	0.56120	38.8	1569	11066
ara-spa	tatoeba-test-v2021-08-07	0.62567	43.7	1511	9708
heb-deu	tatoeba-test-v2021-08-07	0.63131	42.4	3090	25101
heb-eng	tatoeba-test-v2021-08-07	0.64960	49.2	10519	77427
heb-fra	tatoeba-test-v2021-08-07	0.64348	46.3	3281	26123
heb-por	tatoeba-test-v2021-08-07	0.63350	43.2	719	5335
mlt-eng	tatoeba-test-v2021-08-07	0.66653	51.0	203	1165
amh-eng	flores101-devtest	0.47357	21.0	1012	24721
amh-fra	flores101-devtest	0.43155	16.2	1012	28343
amh-por	flores101-devtest	0.42109	15.1	1012	26519
ara-deu	flores101-devtest	0.51110	20.4	1012	25094
ara-fra	flores101-devtest	0.56934	29.7	1012	28343
ara-por	flores101-devtest	0.55727	28.2	1012	26519
ara-spa	flores101-devtest	0.48350	19.5	1012	29199
hau-eng	flores101-devtest	0.46804	21.6	1012	24721
hau-fra	flores101-devtest	0.41827	15.9	1012	28343
heb-eng	flores101-devtest	0.62422	36.6	1012	24721
mlt-eng	flores101-devtest	0.72390	49.1	1012	24721
mlt-fra	flores101-devtest	0.60840	34.7	1012	28343
mlt-por	flores101-devtest	0.59863	31.8	1012	26519
acm-deu	flores200-devtest	0.48947	17.6	1012	25094
acm-eng	flores200-devtest	0.56799	28.5	1012	24721
acm-fra	flores200-devtest	0.53577	26.1	1012	28343
acm-por	flores200-devtest	0.52441	23.9	1012	26519
acm-spa	flores200-devtest	0.46985	18.2	1012	29199
amh-deu	flores200-devtest	0.41553	12.6	1012	25094
amh-eng	flores200-devtest	0.49333	22.5	1012	24721
amh-fra	flores200-devtest	0.44890	17.8	1012	28343
amh-por	flores200-devtest	0.43771	16.5	1012	26519
apc-deu	flores200-devtest	0.47480	16.0	1012	25094
apc-eng	flores200-devtest	0.56075	28.1	1012	24721
apc-fra	flores200-devtest	0.52325	24.6	1012	28343
apc-por	flores200-devtest	0.51055	22.9	1012	26519
apc-spa	flores200-devtest	0.45634	17.2	1012	29199
arz-deu	flores200-devtest	0.45844	14.1	1012	25094
arz-eng	flores200-devtest	0.52534	22.7	1012	24721
arz-fra	flores200-devtest	0.50336	21.8	1012	28343
arz-por	flores200-devtest	0.48741	20.0	1012	26519
arz-spa	flores200-devtest	0.44516	15.8	1012	29199
hau-eng	flores200-devtest	0.48137	23.4	1012	24721
hau-fra	flores200-devtest	0.42981	17.2	1012	28343
hau-por	flores200-devtest	0.41385	15.7	1012	26519
heb-deu	flores200-devtest	0.53482	22.8	1012	25094
heb-eng	flores200-devtest	0.63368	38.0	1012	24721
heb-fra	flores200-devtest	0.58417	32.6	1012	28343
heb-por	flores200-devtest	0.57140	30.7	1012	26519
mlt-eng	flores200-devtest	0.73415	51.1	1012	24721
mlt-fra	flores200-devtest	0.61626	35.8	1012	28343
mlt-spa	flores200-devtest	0.50534	21.8	1012	29199
som-eng	flores200-devtest	0.42764	17.7	1012	24721
tir-por	flores200-devtest	2.931	0.0	1012	26519
hau-eng	newstest2021	0.43744	15.5	997	27372
amh-eng	ntrex128	0.42042	15.0	1997	47673
hau-eng	ntrex128	0.50349	26.1	1997	47673
hau-fra	ntrex128	0.41837	15.8	1997	53481
hau-por	ntrex128	0.40851	15.3	1997	51631
hau-spa	ntrex128	0.43376	18.5	1997	54107
heb-deu	ntrex128	0.49482	17.7	1997	48761
heb-eng	ntrex128	0.59241	31.3	1997	47673
heb-fra	ntrex128	0.52180	24.0	1997	53481
heb-por	ntrex128	0.51248	23.2	1997	51631
mlt-spa	ntrex128	0.57078	30.9	1997	54107
som-eng	ntrex128	0.49187	24.3	1997	47673
som-fra	ntrex128	0.41236	15.1	1997	53481
som-por	ntrex128	0.41550	15.2	1997	51631
som-spa	ntrex128	0.43278	17.6	1997	54107
tir-eng	tico19-test	2.655	0.0	2100	56824

Citation Information

Publications: Democratizing neural machine translation with OPUS-MT and OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

transformers version: 4.45.1
OPUS-MT git hash: a0ea3b3
port time: Mon Oct 7 17:08:30 EEST 2024
port machine: LM0-400-22516.local

Helsinki-NLP
/

opus-mt-tc-bible-big-afa-deu_eng_fra_por_spa