Terjman-Ultra (1.3B)

Our model is built upon the powerful Transformer architecture, leveraging state-of-the-art natural language processing techniques. It is a fine-tuned version of facebook/nllb-200-1.3B on a the darija_english dataset enhanced with curated corpora ensuring high-quality and accurate translations.

It achieves the following results on the evaluation set:

Loss: 2.7070
Bleu: 4.6998
Gen Len: 35.6088

The finetuning was conducted using a A100-40GB and took 32 hours.

Try it out on our dedicated Terjman-Ultra Space 🤗

Usage

Using our model for translation is simple and straightforward. You can integrate it into your projects or workflows via the Hugging Face Transformers library. Here's a basic example of how to use the model in Python:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("atlasia/Terjman-Ultra")
model = AutoModelForSeq2SeqLM.from_pretrained("atlasia/Terjman-Ultra")

# Define your Moroccan Darija Arabizi text
input_text = "Your english text goes here."

# Tokenize the input text
input_tokens = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

# Perform translation
output_tokens = model.generate(**input_tokens)

# Decode the output tokens
output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print("Translation:", output_text)

Example

Let's see an example of transliterating Moroccan Darija Arabizi to Arabic:

Input: "Hi my friend, can you tell me a joke in moroccan darija? I'd be happy to hear that from you!"

Output: "أهلا صاحبي، تقدر تقولي مزحة بالدارجة المغربية؟ غادي نكون فرحان باش نسمعها منك!"

Limiations

This version has some limitations mainly due to the Tokenizer. We're currently collecting more data with the aim of continous improvements.

Feedback

We're continuously striving to improve our model's performance and usability and we will be improving it incrementaly. If you have any feedback, suggestions, or encounter any issues, please don't hesitate to reach out to us.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 4
eval_batch_size: 4
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.03
num_epochs: 25

Training results

Training Loss	Epoch	Step	Validation Loss	Bleu	Gen Len
3.203	0.9999	2242	2.9015	4.3057	36.7548
2.9175	1.9998	4484	2.7602	4.4286	35.708
2.8558	2.9997	6726	2.7303	4.629	35.562
2.8696	4.0	8969	2.7195	4.6537	35.562
2.8604	4.9999	11211	2.7144	4.6905	35.5702
2.8509	5.9998	13453	2.7112	4.599	35.5427
2.853	6.9997	15695	2.7098	4.6625	35.5317
2.8475	8.0	17938	2.7081	4.6901	35.6419
2.8192	8.9999	20180	2.7082	4.5474	35.6391
2.8395	9.9998	22422	2.7077	4.722	35.6088
2.8395	10.9997	24664	2.7076	4.752	35.5868
2.8362	12.0	26907	2.7074	4.6664	35.562
2.8673	12.9999	29149	2.7072	4.7004	35.6639
2.8465	13.9998	31391	2.7076	4.6715	35.5923
2.8281	14.9997	33633	2.7075	4.7045	35.5647
2.8191	16.0	35876	2.7068	4.7487	35.6253
2.874	16.9999	38118	2.7076	4.71	35.6006
2.8666	17.9998	40360	2.7069	4.6047	35.6281
2.8645	18.9997	42602	2.7063	4.6664	35.6088
2.8458	20.0	44845	2.7070	4.6552	35.5813
2.8501	20.9999	47087	2.7074	4.6919	35.5647
2.8309	21.9998	49329	2.7074	4.623	35.6226
2.854	22.9997	51571	2.7072	4.6495	35.5978
2.8407	24.0	53814	2.7070	4.6879	35.5482
2.8129	24.9972	56050	2.7070	4.6998	35.6088

Framework versions

Transformers 4.40.2
Pytorch 2.2.1+cu121
Datasets 2.19.1
Tokenizers 0.19.1

atlasia
/

Terjman-Ultra

You need to agree to share your contact information to access this model

Terjman-Ultra (1.3B)

Usage

Example

Limiations

Feedback

Training hyperparameters

Training results

Framework versions

Model tree for atlasia/Terjman-Ultra

Dataset used to train atlasia/Terjman-Ultra

Collection including atlasia/Terjman-Ultra

Models

Evaluation results