K024/mt5-zh-ja-en-trimmed

This model is finetuned from mt5-base.

The model vocabulary is trimmed to ~1/3 by selecting top 85000 tokens in the training data. The code to trim the vocabulary can be found here.

Usage:

from transformers import (
  T5Tokenizer,
  MT5ForConditionalGeneration,
  Text2TextGenerationPipeline,
)

path = "K024/mt5-zh-ja-en-trimmed"
pipe = Text2TextGenerationPipeline(
  model=MT5ForConditionalGeneration.from_pretrained(path),
  tokenizer=T5Tokenizer.from_pretrained(path),
)

sentence = "ja2zh: 吾輩は猫である。名前はまだ無い。"
res = pipe(sentence, max_length=100, num_beams=4)
res[0]['generated_text']

Training data:

wikimedia-en-ja
wikimedia-en-zh
wikimedia-ja-zh
wikititles-ja-en
wikititles-zh-en
wikimatrix-ja-zh
news-commentary-en-ja
news-commentary-en-zh
news-commentary-ja-zh
ted2020-en-ja
ted2020-en-zh
ted2020-ja-zh

License:

K024
/

mt5-zh-ja-en-trimmed

Spaces using K024/mt5-zh-ja-en-trimmed 2