Romanian paraphrase

v1.0

Fine-tune t5-small model for paraphrase. Since there is no Romanian dataset for paraphrasing, I had to create my own dataset. The dataset contains ~60k examples.

How to use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("BlackKakapo/t5-small-paraphrase-ro")
model = AutoModelForSeq2SeqLM.from_pretrained("BlackKakapo/t5-small-paraphrase-ro")

Or

from transformers import T5ForConditionalGeneration, T5TokenizerFast 

model = T5ForConditionalGeneration.from_pretrained("BlackKakapo/t5-small-paraphrase-ro")
tokenizer = T5TokenizerFast.from_pretrained("BlackKakapo/t5-small-paraphrase-ro")

Generate

text = "Am impresia că fac multe greșeli."

encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)

beam_outputs = model.generate(
    input_ids=input_ids, 
    attention_mask=attention_masks,
    do_sample=True,
    max_length=256,
    top_k=10,
    top_p=0.9,
    early_stopping=False,
    num_return_sequences=5
)

for beam_output in beam_outputs:
    text_para = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    
    if text.lower() != text_para.lower() or text not in final_outputs:
        final_outputs.append(text_para)
        break

print(final_outputs)        

Output

['Cred că fac multe greșeli.']
Downloads last month
12
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.