BeardedMonster
commited on
Commit
•
1a8950c
1
Parent(s):
8f0685d
update README.md
Browse files
README.md
CHANGED
@@ -13,7 +13,7 @@ Pretrained model on Nigerian languages including English using a causal language
|
|
13 |
|
14 |
### Model Description
|
15 |
|
16 |
-
SabiYarn is a transformer
|
17 |
with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.
|
18 |
|
19 |
More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right.
|
@@ -21,7 +21,7 @@ The model uses internally a mask-mechanism to make sure the predictions for the
|
|
21 |
is not calculated across documents.
|
22 |
|
23 |
This way, the model learns an inner representation of the languages that can then be used to extract features useful for downstream tasks. The model is best at what
|
24 |
-
it was pretrained for however, which is generating texts.
|
25 |
|
26 |
This is the smallest version, with 125M parameters.
|
27 |
|
@@ -29,12 +29,12 @@ This is the smallest version, with 125M parameters.
|
|
29 |
- **Funded by [optional]:** Personal
|
30 |
- **Shared by [optional]:** Jeffreypaul
|
31 |
- **Model type:** GPTJX (Adopted from NanoGPT)
|
32 |
-
- **Language(s) (NLP):** English, Yoruba, Hausa, Igbo, Pidgin
|
33 |
|
34 |
|
35 |
### Model Sources [optional]
|
36 |
|
37 |
-
- **Demo
|
38 |
|
39 |
## Uses
|
40 |
|
@@ -85,7 +85,7 @@ input_len = len(input_ids[0])
|
|
85 |
print(tokenizer.decode(output[0][input_len:]))
|
86 |
|
87 |
#Output
|
88 |
-
ọ da tobọ dianẹ ayen rhọnvwe kerhọ-ọ. Ọtiọyena, e de ruiruo aghwoghwo ọkieje. (1 Kọr. 7:9; 1 Kọr. 12:2) Vwọrẹ uyota
|
89 |
|
90 |
#Test on Efik
|
91 |
input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk", return_tensors="pt")["input_ids"]
|
@@ -94,7 +94,7 @@ input_len = len(input_ids[0])
|
|
94 |
print(tokenizer.decode(output[0][input_len:]))
|
95 |
|
96 |
#Output
|
97 |
-
. Edi ediwak nditọ Israel ẹtịn̄ ẹnọ nnyịn mîkemeke ndinam n̄kpọ Abasi.|end_of_text|Ebe foto si, Getty Images Ebe foto si, Getty Images Nkọwa foto, Ndị
|
98 |
|
99 |
input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk mme Jew oro esịt okobụn̄ọde ke ntak idiọkido ke Israel, oro ẹkenyụn̄ ẹdude ke mfụhọ ke itie-ufụn mme nsunsu ido edinam Ido Ukpono Mme Jew eke akpa isua ikie.", return_tensors="pt")["input_ids"]
|
100 |
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
|
@@ -102,7 +102,7 @@ input_len = len(input_ids[0])
|
|
102 |
print(tokenizer.decode(output[0][input_len:]))
|
103 |
|
104 |
#Output
|
105 |
-
Kûsịn idem nnyịme ndifiọk nditọete nnyịn inemesịt onyụn̄ anam nnyịn ikpọn̄utom nnyịn. (Matt. 26:31; Luke 22:42
|
106 |
|
107 |
# Test on English
|
108 |
input_ids = tokenizer("How are you?", return_tensors="pt")["input_ids"]
|
@@ -111,7 +111,7 @@ input_len = len(input_ids[0])
|
|
111 |
print(tokenizer.decode(output[0][input_len:]))
|
112 |
|
113 |
#Output
|
114 |
-
I'm doing alright, thanks for asking. How about you? I'm doing well too. Thanks for asking. So, what have you been up to lately? Not much, just hanging out with friends and family. You know how it is. Yeah,
|
115 |
|
116 |
# Test on Yoruba
|
117 |
input_ids = tokenizer("Awọn eeyan Cairo, ni Egypt ti bẹrẹ si n to lawọn ileesẹ to n ṣe burẹdi bayii.", return_tensors="pt")["input_ids"]
|
@@ -120,7 +120,7 @@ input_len = len(input_ids[0])
|
|
120 |
print(tokenizer.decode(output[0][input_len:]))
|
121 |
|
122 |
#Output
|
123 |
-
|end_of_text|Ti o ba fẹ lati wa ni irú si rẹ awọn olumulo, o le se amọnà wọn taara sinu wa àwárí ojúewé. Yi ni asopọ, iwọ yoo wa wa julọ gbajumo reluwe ipa- -- https://www.saveatrain.com/rout
|
124 |
|
125 |
# Test on Igbo
|
126 |
input_ids = tokenizer("N'ala Igbo, ọtụtụ ndị mmadụ kwenyere na e nwere mmiri ara na elu-ilu", return_tensors="pt")["input_ids"]
|
@@ -129,10 +129,9 @@ input_len = len(input_ids[0])
|
|
129 |
print(tokenizer.decode(output[0][input_len:]))
|
130 |
|
131 |
#Output
|
132 |
-
. Ọ bụ ezie na ọ bụ otu n'ime ihe ndị kasị dị ịrịba ama na nke kachasị ewu ewu na Africa, a na-elekarị ya anya dị ka otu n'ime ndị kasị baa ọgaranya n'ụwa.
|
133 |
Nkọwapụta
|
134 |
-
Ebe nrụọrụ weebụ na-ahụ maka gburugburu ebe
|
135 |
-
|
136 |
|
137 |
# Test on FulFulde/Fulah
|
138 |
input_ids = tokenizer("Jos un peeta gallure nɗer ɗi woyla caaka ɓanngeere lardu Naajeeriya. Gelle ɗen haa e ɗuuɗiri ɗun kamano", return_tensors="pt")["input_ids"]
|
@@ -141,7 +140,7 @@ input_len = len(input_ids[0])
|
|
141 |
print(tokenizer.decode(output[0][input_len:]))
|
142 |
|
143 |
#Output
|
144 |
-
jogiiji maɓɓe nder lesdi Naajeeriya. |end_o|end_of_text|** Muhammadu_Buhari ** Muhammadu Buhari ko leydi e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukum
|
145 |
|
146 |
input_ids = tokenizer("Si hooreejo leydi on (himo wi’ee kadi persidan) accitii laamu, ko woote waɗetee, ɓurɗo jogaade yimɓe on halfinee laamu yeru happu.", return_tensors="pt")["input_ids"]
|
147 |
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
|
@@ -149,7 +148,7 @@ input_len = len(input_ids[0])
|
|
149 |
print(tokenizer.decode(output[0][input_len:]))
|
150 |
|
151 |
#Output
|
152 |
-
|end_of_text|So en nganndii e hitaande 2010, o wiyi : “ko ñalawma hannde golle pulaar walla mbiyen jogiiɗo”. Eɗen mbaawi wiyde «u2008
|
153 |
|
154 |
# Test on Hausa
|
155 |
input_ids = tokenizer("Ministan ya ƙara da cewa dole ne Mista Netanyahu ya sanya ranar da", return_tensors="pt")["input_ids"]
|
@@ -158,18 +157,18 @@ input_len = len(input_ids[0])
|
|
158 |
print(tokenizer.decode(output[0][input_len:]))
|
159 |
|
160 |
#Output
|
161 |
-
za a rantsar da shi a matsayin shugaban ƙasar Isra'ila.|end_of_text|Home > Products > Kamarar Tsaro Ta Cctv (Lambobin 24 Kamarar Tsaro Ta Cctv)
|
162 |
Kamarar Tsaro Ta Cctv - ma'aikata, ma'aikata, mai sayarwa daga Sin
|
163 |
-
Mu masu sana'a ne Kam
|
164 |
|
165 |
# Test on Pidgin
|
166 |
-
input_ids = tokenizer(
|
167 |
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
|
168 |
input_len = len(input_ids[0])
|
169 |
print(tokenizer.decode(output[0][input_len:]))
|
170 |
|
171 |
#Output
|
172 |
-
for di protest.|end_of_text|Wia dis foto come from, AFP/Getty Images Wetin we call dis foto, Some of di people wey dem arrest on top social media na one of di main reasons why some of di protesters enta street to protest against
|
173 |
|
174 |
```
|
175 |
|
|
|
13 |
|
14 |
### Model Description
|
15 |
|
16 |
+
SabiYarn-125M is the first of a series of transformer models (adopted from nanogpt and inspired by GPT-J's architecture) pretrained on a large corpus of Nigerian language data in a self-supervised fashion. This means it was pretrained on the raw texts only,
|
17 |
with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.
|
18 |
|
19 |
More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right.
|
|
|
21 |
is not calculated across documents.
|
22 |
|
23 |
This way, the model learns an inner representation of the languages that can then be used to extract features useful for downstream tasks. The model is best at what
|
24 |
+
it was pretrained for however, which is generating coherent texts.
|
25 |
|
26 |
This is the smallest version, with 125M parameters.
|
27 |
|
|
|
29 |
- **Funded by [optional]:** Personal
|
30 |
- **Shared by [optional]:** Jeffreypaul
|
31 |
- **Model type:** GPTJX (Adopted from NanoGPT)
|
32 |
+
- **Language(s) (NLP):** Majorly English, Yoruba, Hausa, Igbo, Pidgin and some others: Fulah/Fulfulde, Efik, Urhobo.
|
33 |
|
34 |
|
35 |
### Model Sources [optional]
|
36 |
|
37 |
+
- **Demo:**
|
38 |
|
39 |
## Uses
|
40 |
|
|
|
85 |
print(tokenizer.decode(output[0][input_len:]))
|
86 |
|
87 |
#Output
|
88 |
+
""" ọ da tobọ dianẹ ayen rhọnvwe kerhọ-ọ. Ọtiọyena, e de ruiruo aghwoghwo ọkieje. (1 Kọr. 7:9; 1 Kọr. 12:2) Vwọrẹ uyota"""
|
89 |
|
90 |
#Test on Efik
|
91 |
input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk", return_tensors="pt")["input_ids"]
|
|
|
94 |
print(tokenizer.decode(output[0][input_len:]))
|
95 |
|
96 |
#Output
|
97 |
+
""". Edi ediwak nditọ Israel ẹtịn̄ ẹnọ nnyịn mîkemeke ndinam n̄kpọ Abasi.|end_of_text|Ebe foto si, Getty Images Ebe foto si, Getty Images Nkọwa foto, Ndị"""
|
98 |
|
99 |
input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk mme Jew oro esịt okobụn̄ọde ke ntak idiọkido ke Israel, oro ẹkenyụn̄ ẹdude ke mfụhọ ke itie-ufụn mme nsunsu ido edinam Ido Ukpono Mme Jew eke akpa isua ikie.", return_tensors="pt")["input_ids"]
|
100 |
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
|
|
|
102 |
print(tokenizer.decode(output[0][input_len:]))
|
103 |
|
104 |
#Output
|
105 |
+
"""Kûsịn idem nnyịme ndifiọk nditọete nnyịn inemesịt onyụn̄ anam nnyịn ikpọn̄utom nnyịn. (Matt. 26:31; Luke 22:42"""
|
106 |
|
107 |
# Test on English
|
108 |
input_ids = tokenizer("How are you?", return_tensors="pt")["input_ids"]
|
|
|
111 |
print(tokenizer.decode(output[0][input_len:]))
|
112 |
|
113 |
#Output
|
114 |
+
"""I'm doing alright, thanks for asking. How about you? I'm doing well too. Thanks for asking. So, what have you been up to lately? Not much, just hanging out with friends and family. You know how it is. Yeah,"""
|
115 |
|
116 |
# Test on Yoruba
|
117 |
input_ids = tokenizer("Awọn eeyan Cairo, ni Egypt ti bẹrẹ si n to lawọn ileesẹ to n ṣe burẹdi bayii.", return_tensors="pt")["input_ids"]
|
|
|
120 |
print(tokenizer.decode(output[0][input_len:]))
|
121 |
|
122 |
#Output
|
123 |
+
"""|end_of_text|Ti o ba fẹ lati wa ni irú si rẹ awọn olumulo, o le se amọnà wọn taara sinu wa àwárí ojúewé. Yi ni asopọ, iwọ yoo wa wa julọ gbajumo reluwe ipa- -- https://www.saveatrain.com/rout"""
|
124 |
|
125 |
# Test on Igbo
|
126 |
input_ids = tokenizer("N'ala Igbo, ọtụtụ ndị mmadụ kwenyere na e nwere mmiri ara na elu-ilu", return_tensors="pt")["input_ids"]
|
|
|
129 |
print(tokenizer.decode(output[0][input_len:]))
|
130 |
|
131 |
#Output
|
132 |
+
""". Ọ bụ ezie na ọ bụ otu n'ime ihe ndị kasị dị ịrịba ama na nke kachasị ewu ewu na Africa, a na-elekarị ya anya dị ka otu n'ime ndị kasị baa ọgaranya n'ụwa.
|
133 |
Nkọwapụta
|
134 |
+
Ebe nrụọrụ weebụ na-ahụ maka gburugburu ebe"""
|
|
|
135 |
|
136 |
# Test on FulFulde/Fulah
|
137 |
input_ids = tokenizer("Jos un peeta gallure nɗer ɗi woyla caaka ɓanngeere lardu Naajeeriya. Gelle ɗen haa e ɗuuɗiri ɗun kamano", return_tensors="pt")["input_ids"]
|
|
|
140 |
print(tokenizer.decode(output[0][input_len:]))
|
141 |
|
142 |
#Output
|
143 |
+
"""jogiiji maɓɓe nder lesdi Naajeeriya. |end_o|end_of_text|** Muhammadu_Buhari ** Muhammadu Buhari ko leydi e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukum"""
|
144 |
|
145 |
input_ids = tokenizer("Si hooreejo leydi on (himo wi’ee kadi persidan) accitii laamu, ko woote waɗetee, ɓurɗo jogaade yimɓe on halfinee laamu yeru happu.", return_tensors="pt")["input_ids"]
|
146 |
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
|
|
|
148 |
print(tokenizer.decode(output[0][input_len:]))
|
149 |
|
150 |
#Output
|
151 |
+
"""|end_of_text|So en nganndii e hitaande 2010, o wiyi : “ko ñalawma hannde golle pulaar walla mbiyen jogiiɗo”. Eɗen mbaawi wiyde «u2008"""
|
152 |
|
153 |
# Test on Hausa
|
154 |
input_ids = tokenizer("Ministan ya ƙara da cewa dole ne Mista Netanyahu ya sanya ranar da", return_tensors="pt")["input_ids"]
|
|
|
157 |
print(tokenizer.decode(output[0][input_len:]))
|
158 |
|
159 |
#Output
|
160 |
+
"""za a rantsar da shi a matsayin shugaban ƙasar Isra'ila.|end_of_text|Home > Products > Kamarar Tsaro Ta Cctv (Lambobin 24 Kamarar Tsaro Ta Cctv)
|
161 |
Kamarar Tsaro Ta Cctv - ma'aikata, ma'aikata, mai sayarwa daga Sin
|
162 |
+
Mu masu sana'a ne Kam"""
|
163 |
|
164 |
# Test on Pidgin
|
165 |
+
input_ids = tokenizer('Di protesters wey dey wear black and red shirt tok say "enough be enough"', return_tensors="pt")["input_ids"]
|
166 |
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
|
167 |
input_len = len(input_ids[0])
|
168 |
print(tokenizer.decode(output[0][input_len:]))
|
169 |
|
170 |
#Output
|
171 |
+
"""for di protest.|end_of_text|Wia dis foto come from, AFP/Getty Images Wetin we call dis foto, Some of di people wey dem arrest on top social media na one of di main reasons why some of di protesters enta street to protest against"""
|
172 |
|
173 |
```
|
174 |
|