BeardedMonster commited on
Commit
1a8950c
1 Parent(s): 8f0685d

update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -18
README.md CHANGED
@@ -13,7 +13,7 @@ Pretrained model on Nigerian languages including English using a causal language
13
 
14
  ### Model Description
15
 
16
- SabiYarn is a transformer model (adopted from nanogpt and inspired by GPT-J's architecture) pretrained on a large corpus of Nigerian language data in a self-supervised fashion. This means it was pretrained on the raw texts only,
17
  with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.
18
 
19
  More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right.
@@ -21,7 +21,7 @@ The model uses internally a mask-mechanism to make sure the predictions for the
21
  is not calculated across documents.
22
 
23
  This way, the model learns an inner representation of the languages that can then be used to extract features useful for downstream tasks. The model is best at what
24
- it was pretrained for however, which is generating texts.
25
 
26
  This is the smallest version, with 125M parameters.
27
 
@@ -29,12 +29,12 @@ This is the smallest version, with 125M parameters.
29
  - **Funded by [optional]:** Personal
30
  - **Shared by [optional]:** Jeffreypaul
31
  - **Model type:** GPTJX (Adopted from NanoGPT)
32
- - **Language(s) (NLP):** English, Yoruba, Hausa, Igbo, Pidgin. Yet to be tested for Efik, Urhobo.
33
 
34
 
35
  ### Model Sources [optional]
36
 
37
- - **Demo [optional]:** [More Information Needed]
38
 
39
  ## Uses
40
 
@@ -85,7 +85,7 @@ input_len = len(input_ids[0])
85
  print(tokenizer.decode(output[0][input_len:]))
86
 
87
  #Output
88
- ọ da tobọ dianẹ ayen rhọnvwe kerhọ-ọ. Ọtiọyena, e de ruiruo aghwoghwo ọkieje. (1 Kọr. 7:9; 1 Kọr. 12:2) Vwọrẹ uyota
89
 
90
  #Test on Efik
91
  input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk", return_tensors="pt")["input_ids"]
@@ -94,7 +94,7 @@ input_len = len(input_ids[0])
94
  print(tokenizer.decode(output[0][input_len:]))
95
 
96
  #Output
97
- . Edi ediwak nditọ Israel ẹtịn̄ ẹnọ nnyịn mîkemeke ndinam n̄kpọ Abasi.|end_of_text|Ebe foto si, Getty Images Ebe foto si, Getty Images Nkọwa foto, Ndị
98
 
99
  input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk mme Jew oro esịt okobụn̄ọde ke ntak idiọkido ke Israel, oro ẹkenyụn̄ ẹdude ke mfụhọ ke itie-ufụn mme nsunsu ido edinam Ido Ukpono Mme Jew eke akpa isua ikie.", return_tensors="pt")["input_ids"]
100
  output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
@@ -102,7 +102,7 @@ input_len = len(input_ids[0])
102
  print(tokenizer.decode(output[0][input_len:]))
103
 
104
  #Output
105
- Kûsịn idem nnyịme ndifiọk nditọete nnyịn inemesịt onyụn̄ anam nnyịn ikpọn̄utom nnyịn. (Matt. 26:31; Luke 22:42
106
 
107
  # Test on English
108
  input_ids = tokenizer("How are you?", return_tensors="pt")["input_ids"]
@@ -111,7 +111,7 @@ input_len = len(input_ids[0])
111
  print(tokenizer.decode(output[0][input_len:]))
112
 
113
  #Output
114
- I'm doing alright, thanks for asking. How about you? I'm doing well too. Thanks for asking. So, what have you been up to lately? Not much, just hanging out with friends and family. You know how it is. Yeah,
115
 
116
  # Test on Yoruba
117
  input_ids = tokenizer("Awọn eeyan Cairo, ni Egypt ti bẹrẹ si n to lawọn ileesẹ to n ṣe burẹdi bayii.", return_tensors="pt")["input_ids"]
@@ -120,7 +120,7 @@ input_len = len(input_ids[0])
120
  print(tokenizer.decode(output[0][input_len:]))
121
 
122
  #Output
123
- |end_of_text|Ti o ba fẹ lati wa ni irú si rẹ awọn olumulo, o le se amọnà wọn taara sinu wa àwárí ojúewé. Yi ni asopọ, iwọ yoo wa wa julọ gbajumo reluwe ipa- -- https://www.saveatrain.com/rout
124
 
125
  # Test on Igbo
126
  input_ids = tokenizer("N'ala Igbo, ọtụtụ ndị mmadụ kwenyere na e nwere mmiri ara na elu-ilu", return_tensors="pt")["input_ids"]
@@ -129,10 +129,9 @@ input_len = len(input_ids[0])
129
  print(tokenizer.decode(output[0][input_len:]))
130
 
131
  #Output
132
- . Ọ bụ ezie na ọ bụ otu n'ime ihe ndị kasị dị ịrịba ama na nke kachasị ewu ewu na Africa, a na-elekarị ya anya dị ka otu n'ime ndị kasị baa ọgaranya n'ụwa.
133
  Nkọwapụta
134
- Ebe nrụọrụ weebụ na-ahụ maka gburugburu ebe
135
-
136
 
137
  # Test on FulFulde/Fulah
138
  input_ids = tokenizer("Jos un peeta gallure nɗer ɗi woyla caaka ɓanngeere lardu Naajeeriya. Gelle ɗen haa e ɗuuɗiri ɗun kamano", return_tensors="pt")["input_ids"]
@@ -141,7 +140,7 @@ input_len = len(input_ids[0])
141
  print(tokenizer.decode(output[0][input_len:]))
142
 
143
  #Output
144
- jogiiji maɓɓe nder lesdi Naajeeriya. |end_o|end_of_text|** Muhammadu_Buhari ** Muhammadu Buhari ko leydi e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukum
145
 
146
  input_ids = tokenizer("Si hooreejo leydi on (himo wi’ee kadi persidan) accitii laamu, ko woote waɗetee, ɓurɗo jogaade yimɓe on halfinee laamu yeru happu.", return_tensors="pt")["input_ids"]
147
  output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
@@ -149,7 +148,7 @@ input_len = len(input_ids[0])
149
  print(tokenizer.decode(output[0][input_len:]))
150
 
151
  #Output
152
- |end_of_text|So en nganndii e hitaande 2010, o wiyi : “ko ñalawma hannde golle pulaar walla mbiyen jogiiɗo”. Eɗen mbaawi wiyde «u2008
153
 
154
  # Test on Hausa
155
  input_ids = tokenizer("Ministan ya ƙara da cewa dole ne Mista Netanyahu ya sanya ranar da", return_tensors="pt")["input_ids"]
@@ -158,18 +157,18 @@ input_len = len(input_ids[0])
158
  print(tokenizer.decode(output[0][input_len:]))
159
 
160
  #Output
161
- za a rantsar da shi a matsayin shugaban ƙasar Isra'ila.|end_of_text|Home > Products > Kamarar Tsaro Ta Cctv (Lambobin 24 Kamarar Tsaro Ta Cctv)
162
  Kamarar Tsaro Ta Cctv - ma'aikata, ma'aikata, mai sayarwa daga Sin
163
- Mu masu sana'a ne Kam
164
 
165
  # Test on Pidgin
166
- input_ids = tokenizer("Di protesters wey dey wear black and red shirt tok say "enough be enough", return_tensors="pt")["input_ids"]
167
  output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
168
  input_len = len(input_ids[0])
169
  print(tokenizer.decode(output[0][input_len:]))
170
 
171
  #Output
172
- for di protest.|end_of_text|Wia dis foto come from, AFP/Getty Images Wetin we call dis foto, Some of di people wey dem arrest on top social media na one of di main reasons why some of di protesters enta street to protest against
173
 
174
  ```
175
 
 
13
 
14
  ### Model Description
15
 
16
+ SabiYarn-125M is the first of a series of transformer models (adopted from nanogpt and inspired by GPT-J's architecture) pretrained on a large corpus of Nigerian language data in a self-supervised fashion. This means it was pretrained on the raw texts only,
17
  with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.
18
 
19
  More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right.
 
21
  is not calculated across documents.
22
 
23
  This way, the model learns an inner representation of the languages that can then be used to extract features useful for downstream tasks. The model is best at what
24
+ it was pretrained for however, which is generating coherent texts.
25
 
26
  This is the smallest version, with 125M parameters.
27
 
 
29
  - **Funded by [optional]:** Personal
30
  - **Shared by [optional]:** Jeffreypaul
31
  - **Model type:** GPTJX (Adopted from NanoGPT)
32
+ - **Language(s) (NLP):** Majorly English, Yoruba, Hausa, Igbo, Pidgin and some others: Fulah/Fulfulde, Efik, Urhobo.
33
 
34
 
35
  ### Model Sources [optional]
36
 
37
+ - **Demo:**
38
 
39
  ## Uses
40
 
 
85
  print(tokenizer.decode(output[0][input_len:]))
86
 
87
  #Output
88
+ """ ọ da tobọ dianẹ ayen rhọnvwe kerhọ-ọ. Ọtiọyena, e de ruiruo aghwoghwo ọkieje. (1 Kọr. 7:9; 1 Kọr. 12:2) Vwọrẹ uyota"""
89
 
90
  #Test on Efik
91
  input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk", return_tensors="pt")["input_ids"]
 
94
  print(tokenizer.decode(output[0][input_len:]))
95
 
96
  #Output
97
+ """. Edi ediwak nditọ Israel ẹtịn̄ ẹnọ nnyịn mîkemeke ndinam n̄kpọ Abasi.|end_of_text|Ebe foto si, Getty Images Ebe foto si, Getty Images Nkọwa foto, Ndị"""
98
 
99
  input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk mme Jew oro esịt okobụn̄ọde ke ntak idiọkido ke Israel, oro ẹkenyụn̄ ẹdude ke mfụhọ ke itie-ufụn mme nsunsu ido edinam Ido Ukpono Mme Jew eke akpa isua ikie.", return_tensors="pt")["input_ids"]
100
  output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
 
102
  print(tokenizer.decode(output[0][input_len:]))
103
 
104
  #Output
105
+ """Kûsịn idem nnyịme ndifiọk nditọete nnyịn inemesịt onyụn̄ anam nnyịn ikpọn̄utom nnyịn. (Matt. 26:31; Luke 22:42"""
106
 
107
  # Test on English
108
  input_ids = tokenizer("How are you?", return_tensors="pt")["input_ids"]
 
111
  print(tokenizer.decode(output[0][input_len:]))
112
 
113
  #Output
114
+ """I'm doing alright, thanks for asking. How about you? I'm doing well too. Thanks for asking. So, what have you been up to lately? Not much, just hanging out with friends and family. You know how it is. Yeah,"""
115
 
116
  # Test on Yoruba
117
  input_ids = tokenizer("Awọn eeyan Cairo, ni Egypt ti bẹrẹ si n to lawọn ileesẹ to n ṣe burẹdi bayii.", return_tensors="pt")["input_ids"]
 
120
  print(tokenizer.decode(output[0][input_len:]))
121
 
122
  #Output
123
+ """|end_of_text|Ti o ba fẹ lati wa ni irú si rẹ awọn olumulo, o le se amọnà wọn taara sinu wa àwárí ojúewé. Yi ni asopọ, iwọ yoo wa wa julọ gbajumo reluwe ipa- -- https://www.saveatrain.com/rout"""
124
 
125
  # Test on Igbo
126
  input_ids = tokenizer("N'ala Igbo, ọtụtụ ndị mmadụ kwenyere na e nwere mmiri ara na elu-ilu", return_tensors="pt")["input_ids"]
 
129
  print(tokenizer.decode(output[0][input_len:]))
130
 
131
  #Output
132
+ """. Ọ bụ ezie na ọ bụ otu n'ime ihe ndị kasị dị ịrịba ama na nke kachasị ewu ewu na Africa, a na-elekarị ya anya dị ka otu n'ime ndị kasị baa ọgaranya n'ụwa.
133
  Nkọwapụta
134
+ Ebe nrụọrụ weebụ na-ahụ maka gburugburu ebe"""
 
135
 
136
  # Test on FulFulde/Fulah
137
  input_ids = tokenizer("Jos un peeta gallure nɗer ɗi woyla caaka ɓanngeere lardu Naajeeriya. Gelle ɗen haa e ɗuuɗiri ɗun kamano", return_tensors="pt")["input_ids"]
 
140
  print(tokenizer.decode(output[0][input_len:]))
141
 
142
  #Output
143
+ """jogiiji maɓɓe nder lesdi Naajeeriya. |end_o|end_of_text|** Muhammadu_Buhari ** Muhammadu Buhari ko leydi e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukum"""
144
 
145
  input_ids = tokenizer("Si hooreejo leydi on (himo wi’ee kadi persidan) accitii laamu, ko woote waɗetee, ɓurɗo jogaade yimɓe on halfinee laamu yeru happu.", return_tensors="pt")["input_ids"]
146
  output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
 
148
  print(tokenizer.decode(output[0][input_len:]))
149
 
150
  #Output
151
+ """|end_of_text|So en nganndii e hitaande 2010, o wiyi : “ko ñalawma hannde golle pulaar walla mbiyen jogiiɗo”. Eɗen mbaawi wiyde «u2008"""
152
 
153
  # Test on Hausa
154
  input_ids = tokenizer("Ministan ya ƙara da cewa dole ne Mista Netanyahu ya sanya ranar da", return_tensors="pt")["input_ids"]
 
157
  print(tokenizer.decode(output[0][input_len:]))
158
 
159
  #Output
160
+ """za a rantsar da shi a matsayin shugaban ƙasar Isra'ila.|end_of_text|Home > Products > Kamarar Tsaro Ta Cctv (Lambobin 24 Kamarar Tsaro Ta Cctv)
161
  Kamarar Tsaro Ta Cctv - ma'aikata, ma'aikata, mai sayarwa daga Sin
162
+ Mu masu sana'a ne Kam"""
163
 
164
  # Test on Pidgin
165
+ input_ids = tokenizer('Di protesters wey dey wear black and red shirt tok say "enough be enough"', return_tensors="pt")["input_ids"]
166
  output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
167
  input_len = len(input_ids[0])
168
  print(tokenizer.decode(output[0][input_len:]))
169
 
170
  #Output
171
+ """for di protest.|end_of_text|Wia dis foto come from, AFP/Getty Images Wetin we call dis foto, Some of di people wey dem arrest on top social media na one of di main reasons why some of di protesters enta street to protest against"""
172
 
173
  ```
174