atlasia
/

Terjman-Large

@@ -1,27 +1,32 @@
 ---
 license: cc-by-nc-4.0
 base_model: Helsinki-NLP/opus-mt-tc-big-en-ar
-tags:
-- generated_from_trainer
 metrics:
 - bleu
 model-index:
 - name: Terjman-Large
   results: []
 ---
-# Terjman-Large
 Our model is built upon the powerful Transformer architecture, leveraging state-of-the-art natural language processing techniques.
-It has been finetuned on a the "atlasia/darija_english" dataset enhanced with curated corpora ensuring high-quality and accurate translations.
 It achieves the following results on the evaluation set:
 - Loss: 3.2078
 - Bleu: 8.3292
 - Gen Len: 34.4959
-### Training hyperparameters
 The following hyperparameters were used during training:
 - learning_rate: 3e-05
@@ -35,7 +40,55 @@ The following hyperparameters were used during training:
 - lr_scheduler_warmup_ratio: 0.03
 - num_epochs: 40
-### Training results
 | Training Loss | Epoch   | Step  | Validation Loss | Bleu   | Gen Len |
 |:-------------:|:-------:|:-----:|:---------------:|:------:|:-------:|
@@ -80,57 +133,9 @@ The following hyperparameters were used during training:
 | 3.2445        | 38.9994 | 15902 | 3.2079          | 8.3968 | 34.6722 |
 | 3.2356        | 39.9264 | 16280 | 3.2078          | 8.3292 | 34.4959 |
 ### Framework versions
 - Transformers 4.40.2
 - Pytorch 2.2.1+cu121
 - Datasets 2.19.1
-- Tokenizers 0.19.1
-## Usage
-Using our model for translation is simple and straightforward.
-You can integrate it into your projects or workflows via the Hugging Face Transformers library.
-Here's a basic example of how to use the model in Python:
-```python
-from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
-# Load the tokenizer and model
-tokenizer = AutoTokenizer.from_pretrained("atlasia/Terjman-Large")
-model = AutoModelForSeq2SeqLM.from_pretrained("atlasia/Terjman-Large")
-# Define your Moroccan Darija Arabizi text
-input_text = "Your english text goes here."
-# Tokenize the input text
-input_tokens = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)
-# Perform translation
-output_tokens = model.generate(**input_tokens)
-# Decode the output tokens
-output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
-print("Transliteration:", output_text)
-```
-## Example
-Let's see an example of transliterating Moroccan Darija Arabizi to Arabic:
-**Input**: "Hello my friend, how's life in Morocco"
-**Output**: "مرحبا يا صاحبي, كيفاش الحياة فالمغرب"
-## Limiations
-This version has some limitations mainly due to the Tokenizer.
-We're currently collecting more data with the aim of continous improvements.
-## Feedback
-We're continuously striving to improve our model's performance and usability and we will be improving it incrementaly.
-If you have any feedback, suggestions, or encounter any issues, please don't hesitate to reach out to us.

 ---
 license: cc-by-nc-4.0
 base_model: Helsinki-NLP/opus-mt-tc-big-en-ar
 metrics:
 - bleu
+datasets:
+- atlasia/darija_english
 model-index:
 - name: Terjman-Large
   results: []
+language:
+- ar
+- en
 ---
+# Terjman-Large (240M params)
 Our model is built upon the powerful Transformer architecture, leveraging state-of-the-art natural language processing techniques.
+It is a fine-tuned version of [Helsinki-NLP/opus-mt-tc-big-en-ar](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-ar) on a the [darija_english](atlasia/darija_english) dataset enhanced with curated corpora ensuring high-quality and accurate translations.
 It achieves the following results on the evaluation set:
 - Loss: 3.2078
 - Bleu: 8.3292
 - Gen Len: 34.4959
+The finetuning was conducted using a A**100-40GB** and took **23 hours**.
+## Training hyperparameters
 The following hyperparameters were used during training:
 - learning_rate: 3e-05
 - lr_scheduler_warmup_ratio: 0.03
 - num_epochs: 40
+## Usage
+Using our model for translation is simple and straightforward.
+You can integrate it into your projects or workflows via the Hugging Face Transformers library.
+Here's a basic example of how to use the model in Python:
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+# Load the tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained("atlasia/Terjman-Large")
+model = AutoModelForSeq2SeqLM.from_pretrained("atlasia/Terjman-Large")
+# Define your Moroccan Darija Arabizi text
+input_text = "Your english text goes here."
+# Tokenize the input text
+input_tokens = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)
+# Perform translation
+output_tokens = model.generate(**input_tokens)
+# Decode the output tokens
+output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
+print("Translation:", output_text)
+```
+## Example
+Let's see an example of transliterating Moroccan Darija Arabizi to Arabic:
+**Input**: "Hello my friend, how's life in Morocco"
+**Output**: "مرحبا يا صاحبي, كيفاش الحياة فالمغرب"
+## Limiations
+This version has some limitations mainly due to the Tokenizer.
+We're currently collecting more data with the aim of continous improvements.
+## Feedback
+We're continuously striving to improve our model's performance and usability and we will be improving it incrementaly.
+If you have any feedback, suggestions, or encounter any issues, please don't hesitate to reach out to us.
+## Training results
 | Training Loss | Epoch   | Step  | Validation Loss | Bleu   | Gen Len |
 |:-------------:|:-------:|:-----:|:---------------:|:------:|:-------:|
 | 3.2445        | 38.9994 | 15902 | 3.2079          | 8.3968 | 34.6722 |
 | 3.2356        | 39.9264 | 16280 | 3.2078          | 8.3292 | 34.4959 |
 ### Framework versions
 - Transformers 4.40.2
 - Pytorch 2.2.1+cu121
 - Datasets 2.19.1
+- Tokenizers 0.19.1