Update README.md
Browse files
README.md
CHANGED
@@ -13,8 +13,8 @@ The tokenizer was trained on a comprehensive dataset, including:
|
|
13 |
- English and Dutch Wikipedia (278M and 356M, respectively)
|
14 |
- Dutch and English book datasets (211M and 355M, respectively)
|
15 |
- Dutch news articles (256M)
|
16 |
-
- CodeParrot GitHub code (158M)
|
17 |
-
- CodeSearchNet
|
18 |
- Markdown files with math markup (5.8M)
|
19 |
- Arxiv scientific papers (169M)
|
20 |
|
|
|
13 |
- English and Dutch Wikipedia (278M and 356M, respectively)
|
14 |
- Dutch and English book datasets (211M and 355M, respectively)
|
15 |
- Dutch news articles (256M)
|
16 |
+
- CodeParrot GitHub Python code (158M)
|
17 |
+
- CodeSearchNet Python code (126M)
|
18 |
- Markdown files with math markup (5.8M)
|
19 |
- Arxiv scientific papers (169M)
|
20 |
|