BounharAbdelaziz commited on
Commit
1ca7c48
1 Parent(s): 3b3a6bb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -55
README.md CHANGED
@@ -1,27 +1,32 @@
1
  ---
2
  license: cc-by-nc-4.0
3
  base_model: Helsinki-NLP/opus-mt-tc-big-en-ar
4
- tags:
5
- - generated_from_trainer
6
  metrics:
7
  - bleu
 
 
8
  model-index:
9
  - name: Terjman-Large
10
  results: []
 
 
 
11
  ---
12
 
13
- # Terjman-Large
14
 
15
  Our model is built upon the powerful Transformer architecture, leveraging state-of-the-art natural language processing techniques.
16
- It has been finetuned on a the "atlasia/darija_english" dataset enhanced with curated corpora ensuring high-quality and accurate translations.
17
 
18
  It achieves the following results on the evaluation set:
19
  - Loss: 3.2078
20
  - Bleu: 8.3292
21
  - Gen Len: 34.4959
 
 
22
 
23
 
24
- ### Training hyperparameters
25
 
26
  The following hyperparameters were used during training:
27
  - learning_rate: 3e-05
@@ -35,7 +40,55 @@ The following hyperparameters were used during training:
35
  - lr_scheduler_warmup_ratio: 0.03
36
  - num_epochs: 40
37
 
38
- ### Training results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
  | Training Loss | Epoch | Step | Validation Loss | Bleu | Gen Len |
41
  |:-------------:|:-------:|:-----:|:---------------:|:------:|:-------:|
@@ -80,57 +133,9 @@ The following hyperparameters were used during training:
80
  | 3.2445 | 38.9994 | 15902 | 3.2079 | 8.3968 | 34.6722 |
81
  | 3.2356 | 39.9264 | 16280 | 3.2078 | 8.3292 | 34.4959 |
82
 
83
-
84
  ### Framework versions
85
 
86
  - Transformers 4.40.2
87
  - Pytorch 2.2.1+cu121
88
  - Datasets 2.19.1
89
- - Tokenizers 0.19.1
90
-
91
-
92
- ## Usage
93
-
94
- Using our model for translation is simple and straightforward.
95
- You can integrate it into your projects or workflows via the Hugging Face Transformers library.
96
- Here's a basic example of how to use the model in Python:
97
-
98
- ```python
99
- from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
100
-
101
- # Load the tokenizer and model
102
- tokenizer = AutoTokenizer.from_pretrained("atlasia/Terjman-Large")
103
- model = AutoModelForSeq2SeqLM.from_pretrained("atlasia/Terjman-Large")
104
-
105
- # Define your Moroccan Darija Arabizi text
106
- input_text = "Your english text goes here."
107
-
108
- # Tokenize the input text
109
- input_tokens = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)
110
-
111
- # Perform translation
112
- output_tokens = model.generate(**input_tokens)
113
-
114
- # Decode the output tokens
115
- output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
116
-
117
- print("Transliteration:", output_text)
118
- ```
119
-
120
- ## Example
121
-
122
- Let's see an example of transliterating Moroccan Darija Arabizi to Arabic:
123
-
124
- **Input**: "Hello my friend, how's life in Morocco"
125
-
126
- **Output**: "مرحبا يا صاحبي, كيفاش الحياة فالمغرب"
127
-
128
- ## Limiations
129
-
130
- This version has some limitations mainly due to the Tokenizer.
131
- We're currently collecting more data with the aim of continous improvements.
132
-
133
- ## Feedback
134
-
135
- We're continuously striving to improve our model's performance and usability and we will be improving it incrementaly.
136
- If you have any feedback, suggestions, or encounter any issues, please don't hesitate to reach out to us.
 
1
  ---
2
  license: cc-by-nc-4.0
3
  base_model: Helsinki-NLP/opus-mt-tc-big-en-ar
 
 
4
  metrics:
5
  - bleu
6
+ datasets:
7
+ - atlasia/darija_english
8
  model-index:
9
  - name: Terjman-Large
10
  results: []
11
+ language:
12
+ - ar
13
+ - en
14
  ---
15
 
16
+ # Terjman-Large (240M params)
17
 
18
  Our model is built upon the powerful Transformer architecture, leveraging state-of-the-art natural language processing techniques.
19
+ It is a fine-tuned version of [Helsinki-NLP/opus-mt-tc-big-en-ar](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-ar) on a the [darija_english](atlasia/darija_english) dataset enhanced with curated corpora ensuring high-quality and accurate translations.
20
 
21
  It achieves the following results on the evaluation set:
22
  - Loss: 3.2078
23
  - Bleu: 8.3292
24
  - Gen Len: 34.4959
25
+
26
+ The finetuning was conducted using a A**100-40GB** and took **23 hours**.
27
 
28
 
29
+ ## Training hyperparameters
30
 
31
  The following hyperparameters were used during training:
32
  - learning_rate: 3e-05
 
40
  - lr_scheduler_warmup_ratio: 0.03
41
  - num_epochs: 40
42
 
43
+
44
+ ## Usage
45
+
46
+ Using our model for translation is simple and straightforward.
47
+ You can integrate it into your projects or workflows via the Hugging Face Transformers library.
48
+ Here's a basic example of how to use the model in Python:
49
+
50
+ ```python
51
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
52
+
53
+ # Load the tokenizer and model
54
+ tokenizer = AutoTokenizer.from_pretrained("atlasia/Terjman-Large")
55
+ model = AutoModelForSeq2SeqLM.from_pretrained("atlasia/Terjman-Large")
56
+
57
+ # Define your Moroccan Darija Arabizi text
58
+ input_text = "Your english text goes here."
59
+
60
+ # Tokenize the input text
61
+ input_tokens = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)
62
+
63
+ # Perform translation
64
+ output_tokens = model.generate(**input_tokens)
65
+
66
+ # Decode the output tokens
67
+ output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
68
+
69
+ print("Translation:", output_text)
70
+ ```
71
+
72
+ ## Example
73
+
74
+ Let's see an example of transliterating Moroccan Darija Arabizi to Arabic:
75
+
76
+ **Input**: "Hello my friend, how's life in Morocco"
77
+
78
+ **Output**: "مرحبا يا صاحبي, كيفاش الحياة فالمغرب"
79
+
80
+ ## Limiations
81
+
82
+ This version has some limitations mainly due to the Tokenizer.
83
+ We're currently collecting more data with the aim of continous improvements.
84
+
85
+ ## Feedback
86
+
87
+ We're continuously striving to improve our model's performance and usability and we will be improving it incrementaly.
88
+ If you have any feedback, suggestions, or encounter any issues, please don't hesitate to reach out to us.
89
+
90
+
91
+ ## Training results
92
 
93
  | Training Loss | Epoch | Step | Validation Loss | Bleu | Gen Len |
94
  |:-------------:|:-------:|:-----:|:---------------:|:------:|:-------:|
 
133
  | 3.2445 | 38.9994 | 15902 | 3.2079 | 8.3968 | 34.6722 |
134
  | 3.2356 | 39.9264 | 16280 | 3.2078 | 8.3292 | 34.4959 |
135
 
 
136
  ### Framework versions
137
 
138
  - Transformers 4.40.2
139
  - Pytorch 2.2.1+cu121
140
  - Datasets 2.19.1
141
+ - Tokenizers 0.19.1