opus-mt-tc-base-gmw-gmw

Neural machine translation model for translating from West Germanic languages (gmw) to West Germanic languages (gmw).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train.

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Model info

  • Release: 2021-02-23
  • source language(s): afr deu eng fry gos hrx ltz nds nld pdc yid
  • target language(s): afr deu eng fry nds nld
  • valid target language labels: >>afr<< >>ang_Latn<< >>deu<< >>eng<< >>fry<< >>ltz<< >>nds<< >>nld<< >>sco<< >>yid<<
  • model: transformer (base)
  • data: opus (source)
  • tokenization: SentencePiece (spm32k,spm32k)
  • original model: opus-2021-02-23.zip
  • more information released models: OPUS-MT gmw-gmw README
  • more information about the model: MarianMT

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>afr<<

Usage

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>nld<< You need help.",
    ">>afr<< I love your son."
]

model_name = "pytorch-models/opus-mt-tc-base-gmw-gmw"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     Je hebt hulp nodig.
#     Ek is lief vir jou seun.

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-base-gmw-gmw")
print(pipe(>>nld<< You need help.))

# expected output: Je hebt hulp nodig.

Benchmarks

langpair testset chr-F BLEU #sent #words
afr-deu tatoeba-test-v2021-08-07 0.674 48.1 1583 9105
afr-eng tatoeba-test-v2021-08-07 0.728 58.8 1374 9622
afr-nld tatoeba-test-v2021-08-07 0.711 54.5 1056 6710
deu-afr tatoeba-test-v2021-08-07 0.696 52.4 1583 9507
deu-eng tatoeba-test-v2021-08-07 0.609 42.1 17565 149462
deu-nds tatoeba-test-v2021-08-07 0.442 18.6 9999 76137
deu-nld tatoeba-test-v2021-08-07 0.672 48.7 10218 75235
eng-afr tatoeba-test-v2021-08-07 0.735 56.5 1374 10317
eng-deu tatoeba-test-v2021-08-07 0.580 35.9 17565 151568
eng-nds tatoeba-test-v2021-08-07 0.412 16.6 2500 18264
eng-nld tatoeba-test-v2021-08-07 0.663 48.3 12696 91796
fry-eng tatoeba-test-v2021-08-07 0.500 32.5 220 1573
fry-nld tatoeba-test-v2021-08-07 0.633 43.1 260 1854
gos-nld tatoeba-test-v2021-08-07 0.405 15.6 1852 9903
hrx-deu tatoeba-test-v2021-08-07 0.484 24.7 471 2805
hrx-eng tatoeba-test-v2021-08-07 0.362 20.4 221 1235
ltz-deu tatoeba-test-v2021-08-07 0.556 37.2 347 2208
ltz-eng tatoeba-test-v2021-08-07 0.485 32.4 293 1840
ltz-nld tatoeba-test-v2021-08-07 0.534 39.3 292 1685
nds-deu tatoeba-test-v2021-08-07 0.572 34.5 9999 74564
nds-eng tatoeba-test-v2021-08-07 0.493 29.9 2500 17589
nds-nld tatoeba-test-v2021-08-07 0.621 42.3 1657 11490
nld-afr tatoeba-test-v2021-08-07 0.755 58.8 1056 6823
nld-deu tatoeba-test-v2021-08-07 0.686 50.4 10218 74131
nld-eng tatoeba-test-v2021-08-07 0.690 53.1 12696 89978
nld-fry tatoeba-test-v2021-08-07 0.478 25.1 260 1857
nld-nds tatoeba-test-v2021-08-07 0.462 21.4 1657 11711
afr-deu flores101-devtest 0.524 21.6 1012 25094
afr-eng flores101-devtest 0.693 46.8 1012 24721
afr-nld flores101-devtest 0.509 18.4 1012 25467
deu-afr flores101-devtest 0.534 21.4 1012 25740
deu-eng flores101-devtest 0.616 33.8 1012 24721
deu-nld flores101-devtest 0.516 19.2 1012 25467
eng-afr flores101-devtest 0.628 33.8 1012 25740
eng-deu flores101-devtest 0.581 29.1 1012 25094
eng-nld flores101-devtest 0.533 21.0 1012 25467
ltz-afr flores101-devtest 0.430 12.9 1012 25740
ltz-deu flores101-devtest 0.482 17.1 1012 25094
ltz-eng flores101-devtest 0.468 18.8 1012 24721
ltz-nld flores101-devtest 0.409 10.7 1012 25467
nld-afr flores101-devtest 0.494 16.8 1012 25740
nld-deu flores101-devtest 0.501 17.9 1012 25094
nld-eng flores101-devtest 0.551 25.6 1012 24721
deu-eng multi30k_test_2016_flickr 0.546 32.2 1000 12955
eng-deu multi30k_test_2016_flickr 0.582 28.8 1000 12106
deu-eng multi30k_test_2017_flickr 0.561 32.7 1000 11374
eng-deu multi30k_test_2017_flickr 0.573 27.6 1000 10755
deu-eng multi30k_test_2017_mscoco 0.499 25.5 461 5231
eng-deu multi30k_test_2017_mscoco 0.514 22.0 461 5158
deu-eng multi30k_test_2018_flickr 0.535 30.0 1071 14689
eng-deu multi30k_test_2018_flickr 0.547 25.3 1071 13703
deu-eng newssyscomb2009 0.527 25.4 502 11818
eng-deu newssyscomb2009 0.504 19.3 502 11271
deu-eng news-test2008 0.518 23.8 2051 49380
eng-deu news-test2008 0.492 19.3 2051 47447
deu-eng newstest2009 0.516 23.4 2525 65399
eng-deu newstest2009 0.498 18.8 2525 62816
deu-eng newstest2010 0.546 25.8 2489 61711
eng-deu newstest2010 0.508 20.7 2489 61503
deu-eng newstest2011 0.524 23.7 3003 74681
eng-deu newstest2011 0.493 19.2 3003 72981
deu-eng newstest2012 0.532 24.8 3003 72812
eng-deu newstest2012 0.493 19.5 3003 72886
deu-eng newstest2013 0.548 27.7 3000 64505
eng-deu newstest2013 0.517 22.5 3000 63737
deu-eng newstest2014-deen 0.548 27.3 3003 67337
eng-deu newstest2014-deen 0.532 22.0 3003 62688
deu-eng newstest2015-deen 0.553 28.6 2169 46443
eng-deu newstest2015-ende 0.544 25.7 2169 44260
deu-eng newstest2016-deen 0.596 33.3 2999 64119
eng-deu newstest2016-ende 0.580 30.0 2999 62669
deu-eng newstest2017-deen 0.561 29.5 3004 64399
eng-deu newstest2017-ende 0.535 24.1 3004 61287
deu-eng newstest2018-deen 0.610 36.1 2998 67012
eng-deu newstest2018-ende 0.613 35.4 2998 64276
deu-eng newstest2019-deen 0.582 32.3 2000 39227
eng-deu newstest2019-ende 0.583 31.2 1997 48746
deu-eng newstest2020-deen 0.604 32.0 785 38220
eng-deu newstest2020-ende 0.542 23.9 1418 52383
deu-eng newstestB2020-deen 0.598 31.2 785 37696
eng-deu newstestB2020-ende 0.532 23.3 1418 53092

Acknowledgements

The work is supported by the European Language Grid as pilot project 2866, by the FoTran project, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771113), and the MeMAD project, funded by the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No 780069. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland.

Model conversion info

  • transformers version: 4.12.3
  • OPUS-MT git hash: e56a06b
  • port time: Sun Feb 13 14:42:10 EET 2022
  • port machine: LM0-400-22516.local
Downloads last month
16
Safetensors
Model size
62.3M params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using Helsinki-NLP/opus-mt-tc-base-gmw-gmw 7

Evaluation results