yhavinga
/

dutch-llama-tokenizer

Model card Files Files and versions Community

yhavinga commited on Jan 3

Commit

d284ee0

•

1 Parent(s): e974e27

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -13,8 +13,8 @@ The tokenizer was trained on a comprehensive dataset, including:
 - English and Dutch Wikipedia (278M and 356M, respectively)
 - Dutch and English book datasets (211M and 355M, respectively)
 - Dutch news articles (256M)
-- CodeParrot GitHub code (158M)
-- CodeSearchNet diverse code (126M)
 - Markdown files with math markup (5.8M)
 - Arxiv scientific papers (169M)

 - English and Dutch Wikipedia (278M and 356M, respectively)
 - Dutch and English book datasets (211M and 355M, respectively)
 - Dutch news articles (256M)
+- CodeParrot GitHub Python code (158M)
+- CodeSearchNet Python code (126M)
 - Markdown files with math markup (5.8M)
 - Arxiv scientific papers (169M)