BeardedMonster
/

SabiYarn-125M

Text Generation

Transformers

Safetensors

nanogpt-j

custom_code

Model card Files Files and versions Community

BeardedMonster commited on Jul 7

Commit

1a8950c

•

1 Parent(s): 8f0685d

update README.md

Browse files

Files changed (1) hide show

README.md +17 -18

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ Pretrained model on Nigerian languages including English using a causal language
 ### Model Description
-SabiYarn is a transformer model (adopted from nanogpt and inspired by GPT-J's architecture) pretrained on a large corpus of Nigerian language data in a self-supervised fashion. This means it was pretrained on the raw texts only,
 with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.
 More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right.
@@ -21,7 +21,7 @@ The model uses internally a mask-mechanism to make sure the predictions for the
 is not calculated across documents.
 This way, the model learns an inner representation of the languages that can then be used to extract features useful for downstream tasks. The model is best at what
-it was pretrained for however, which is generating texts.
 This is the smallest version, with 125M parameters.
@@ -29,12 +29,12 @@ This is the smallest version, with 125M parameters.
 - **Funded by [optional]:** Personal
 - **Shared by [optional]:** Jeffreypaul
 - **Model type:** GPTJX (Adopted from NanoGPT)
-- **Language(s) (NLP):** English, Yoruba, Hausa, Igbo, Pidgin. Yet to be tested for Efik, Urhobo.
 ### Model Sources [optional]
-- **Demo [optional]:** [More Information Needed]
 ## Uses
@@ -85,7 +85,7 @@ input_len = len(input_ids[0])
 print(tokenizer.decode(output[0][input_len:]))
 #Output
- ọ da tobọ dianẹ ayen rhọnvwe kerhọ-ọ. Ọtiọyena, e de ruiruo aghwoghwo ọkieje. (1 Kọr. 7:9; 1 Kọr. 12:2) Vwọrẹ uyota
 #Test on Efik
 input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk", return_tensors="pt")["input_ids"]
@@ -94,7 +94,7 @@ input_len = len(input_ids[0])
 print(tokenizer.decode(output[0][input_len:]))
 #Output
-. Edi ediwak nditọ Israel ẹtịn̄ ẹnọ nnyịn mîkemeke ndinam n̄kpọ Abasi.|end_of_text|Ebe foto si, Getty Images Ebe foto si, Getty Images Nkọwa foto, Ndị
 input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk mme Jew oro esịt okobụn̄ọde ke ntak idiọkido ke Israel, oro ẹkenyụn̄ ẹdude ke mfụhọ ke itie-ufụn mme nsunsu ido edinam Ido Ukpono Mme Jew eke akpa isua ikie.", return_tensors="pt")["input_ids"]
 output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
@@ -102,7 +102,7 @@ input_len = len(input_ids[0])
 print(tokenizer.decode(output[0][input_len:]))
 #Output
-Kûsịn idem nnyịme ndifiọk nditọete nnyịn inemesịt onyụn̄ anam nnyịn ikpọn̄utom nnyịn. (Matt. 26:31; Luke 22:42
 # Test on English
 input_ids = tokenizer("How are you?", return_tensors="pt")["input_ids"]
@@ -111,7 +111,7 @@ input_len = len(input_ids[0])
 print(tokenizer.decode(output[0][input_len:]))
 #Output
-I'm doing alright, thanks for asking. How about you? I'm doing well too. Thanks for asking. So, what have you been up to lately? Not much, just hanging out with friends and family. You know how it is. Yeah,
 # Test on Yoruba
 input_ids = tokenizer("Awọn eeyan Cairo, ni Egypt ti bẹrẹ si n to lawọn ileesẹ to n ṣe burẹdi bayii.", return_tensors="pt")["input_ids"]
@@ -120,7 +120,7 @@ input_len = len(input_ids[0])
 print(tokenizer.decode(output[0][input_len:]))
 #Output
-|end_of_text|Ti o ba fẹ lati wa ni irú si rẹ awọn olumulo, o le se amọnà wọn taara sinu wa àwárí ojúewé. Yi ni asopọ, iwọ yoo wa wa julọ gbajumo reluwe ipa- -- https://www.saveatrain.com/rout
 # Test on Igbo
 input_ids = tokenizer("N'ala Igbo, ọtụtụ ndị mmadụ kwenyere na e nwere mmiri ara na elu-ilu", return_tensors="pt")["input_ids"]
@@ -129,10 +129,9 @@ input_len = len(input_ids[0])
 print(tokenizer.decode(output[0][input_len:]))
 #Output
-. Ọ bụ ezie na ọ bụ otu n'ime ihe ndị kasị dị ịrịba ama na nke kachasị ewu ewu na Africa, a na-elekarị ya anya dị ka otu n'ime ndị kasị baa ọgaranya n'ụwa.
 Nkọwapụta
-Ebe nrụọrụ weebụ na-ahụ maka gburugburu ebe
 # Test on FulFulde/Fulah
 input_ids = tokenizer("Jos un peeta gallure nɗer ɗi woyla caaka ɓanngeere lardu Naajeeriya. Gelle ɗen haa e ɗuuɗiri ɗun kamano", return_tensors="pt")["input_ids"]
@@ -141,7 +140,7 @@ input_len = len(input_ids[0])
 print(tokenizer.decode(output[0][input_len:]))
 #Output
-jogiiji maɓɓe nder lesdi Naajeeriya. |end_o|end_of_text|** Muhammadu_Buhari ** Muhammadu Buhari ko leydi e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukum
 input_ids = tokenizer("Si hooreejo leydi on (himo wi’ee kadi persidan) accitii laamu, ko woote waɗetee, ɓurɗo jogaade yimɓe on halfinee laamu yeru happu.", return_tensors="pt")["input_ids"]
 output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
@@ -149,7 +148,7 @@ input_len = len(input_ids[0])
 print(tokenizer.decode(output[0][input_len:]))
 #Output
-|end_of_text|So en nganndii e hitaande 2010, o wiyi : “ko ñalawma hannde golle pulaar walla mbiyen jogiiɗo”. Eɗen mbaawi wiyde «u2008
 # Test on Hausa
 input_ids = tokenizer("Ministan ya ƙara da cewa dole ne Mista Netanyahu ya sanya ranar da", return_tensors="pt")["input_ids"]
@@ -158,18 +157,18 @@ input_len = len(input_ids[0])
 print(tokenizer.decode(output[0][input_len:]))
 #Output
-za a rantsar da shi a matsayin shugaban ƙasar Isra'ila.|end_of_text|Home > Products > Kamarar Tsaro Ta Cctv (Lambobin 24 Kamarar Tsaro Ta Cctv)
 Kamarar Tsaro Ta Cctv - ma'aikata, ma'aikata, mai sayarwa daga Sin
-Mu masu sana'a ne Kam
 # Test on Pidgin
-input_ids = tokenizer("Di protesters wey dey wear black and red shirt tok say "enough be enough", return_tensors="pt")["input_ids"]
 output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
 input_len = len(input_ids[0])
 print(tokenizer.decode(output[0][input_len:]))
 #Output
-for di protest.|end_of_text|Wia dis foto come from, AFP/Getty Images Wetin we call dis foto, Some of di people wey dem arrest on top social media na one of di main reasons why some of di protesters enta street to protest against
 ```

 ### Model Description
+SabiYarn-125M is the first of a series of transformer models (adopted from nanogpt and inspired by GPT-J's architecture) pretrained on a large corpus of Nigerian language data in a self-supervised fashion. This means it was pretrained on the raw texts only,
 with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.
 More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right.
 is not calculated across documents.
 This way, the model learns an inner representation of the languages that can then be used to extract features useful for downstream tasks. The model is best at what
+it was pretrained for however, which is generating coherent texts.
 This is the smallest version, with 125M parameters.
 - **Funded by [optional]:** Personal
 - **Shared by [optional]:** Jeffreypaul
 - **Model type:** GPTJX (Adopted from NanoGPT)
+- **Language(s) (NLP):** Majorly English, Yoruba, Hausa, Igbo, Pidgin and some others: Fulah/Fulfulde, Efik, Urhobo.
 ### Model Sources [optional]
+- **Demo:**
 ## Uses
 print(tokenizer.decode(output[0][input_len:]))
 #Output
+""" ọ da tobọ dianẹ ayen rhọnvwe kerhọ-ọ. Ọtiọyena, e de ruiruo aghwoghwo ọkieje. (1 Kọr. 7:9; 1 Kọr. 12:2) Vwọrẹ uyota"""
 #Test on Efik
 input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk", return_tensors="pt")["input_ids"]
 print(tokenizer.decode(output[0][input_len:]))
 #Output
+""". Edi ediwak nditọ Israel ẹtịn̄ ẹnọ nnyịn mîkemeke ndinam n̄kpọ Abasi.|end_of_text|Ebe foto si, Getty Images Ebe foto si, Getty Images Nkọwa foto, Ndị"""
 input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk mme Jew oro esịt okobụn̄ọde ke ntak idiọkido ke Israel, oro ẹkenyụn̄ ẹdude ke mfụhọ ke itie-ufụn mme nsunsu ido edinam Ido Ukpono Mme Jew eke akpa isua ikie.", return_tensors="pt")["input_ids"]
 output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
 print(tokenizer.decode(output[0][input_len:]))
 #Output
+"""Kûsịn idem nnyịme ndifiọk nditọete nnyịn inemesịt onyụn̄ anam nnyịn ikpọn̄utom nnyịn. (Matt. 26:31; Luke 22:42"""
 # Test on English
 input_ids = tokenizer("How are you?", return_tensors="pt")["input_ids"]
 print(tokenizer.decode(output[0][input_len:]))
 #Output
+"""I'm doing alright, thanks for asking. How about you? I'm doing well too. Thanks for asking. So, what have you been up to lately? Not much, just hanging out with friends and family. You know how it is. Yeah,"""
 # Test on Yoruba
 input_ids = tokenizer("Awọn eeyan Cairo, ni Egypt ti bẹrẹ si n to lawọn ileesẹ to n ṣe burẹdi bayii.", return_tensors="pt")["input_ids"]
 print(tokenizer.decode(output[0][input_len:]))
 #Output
+"""|end_of_text|Ti o ba fẹ lati wa ni irú si rẹ awọn olumulo, o le se amọnà wọn taara sinu wa àwárí ojúewé. Yi ni asopọ, iwọ yoo wa wa julọ gbajumo reluwe ipa- -- https://www.saveatrain.com/rout"""
 # Test on Igbo
 input_ids = tokenizer("N'ala Igbo, ọtụtụ ndị mmadụ kwenyere na e nwere mmiri ara na elu-ilu", return_tensors="pt")["input_ids"]
 print(tokenizer.decode(output[0][input_len:]))
 #Output
+""". Ọ bụ ezie na ọ bụ otu n'ime ihe ndị kasị dị ịrịba ama na nke kachasị ewu ewu na Africa, a na-elekarị ya anya dị ka otu n'ime ndị kasị baa ọgaranya n'ụwa.
 Nkọwapụta
+Ebe nrụọrụ weebụ na-ahụ maka gburugburu ebe"""
 # Test on FulFulde/Fulah
 input_ids = tokenizer("Jos un peeta gallure nɗer ɗi woyla caaka ɓanngeere lardu Naajeeriya. Gelle ɗen haa e ɗuuɗiri ɗun kamano", return_tensors="pt")["input_ids"]
 print(tokenizer.decode(output[0][input_len:]))
 #Output
+"""jogiiji maɓɓe nder lesdi Naajeeriya. |end_o|end_of_text|** Muhammadu_Buhari ** Muhammadu Buhari ko leydi e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukum"""
 input_ids = tokenizer("Si hooreejo leydi on (himo wi’ee kadi persidan) accitii laamu, ko woote waɗetee, ɓurɗo jogaade yimɓe on halfinee laamu yeru happu.", return_tensors="pt")["input_ids"]
 output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
 print(tokenizer.decode(output[0][input_len:]))
 #Output
+"""|end_of_text|So en nganndii e hitaande 2010, o wiyi : “ko ñalawma hannde golle pulaar walla mbiyen jogiiɗo”. Eɗen mbaawi wiyde «u2008"""
 # Test on Hausa
 input_ids = tokenizer("Ministan ya ƙara da cewa dole ne Mista Netanyahu ya sanya ranar da", return_tensors="pt")["input_ids"]
 print(tokenizer.decode(output[0][input_len:]))
 #Output
+"""za a rantsar da shi a matsayin shugaban ƙasar Isra'ila.|end_of_text|Home > Products > Kamarar Tsaro Ta Cctv (Lambobin 24 Kamarar Tsaro Ta Cctv)
 Kamarar Tsaro Ta Cctv - ma'aikata, ma'aikata, mai sayarwa daga Sin
+Mu masu sana'a ne Kam"""
 # Test on Pidgin
+input_ids = tokenizer('Di protesters wey dey wear black and red shirt tok say "enough be enough"', return_tensors="pt")["input_ids"]
 output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
 input_len = len(input_ids[0])
 print(tokenizer.decode(output[0][input_len:]))
 #Output
+"""for di protest.|end_of_text|Wia dis foto come from, AFP/Getty Images Wetin we call dis foto, Some of di people wey dem arrest on top social media na one of di main reasons why some of di protesters enta street to protest against"""
 ```