HuggingFaceGECLM
/

mix_tok_v2

Model card Files Files and versions Community

mix_tok_v2 / README.md

teven's picture

Create README.md

a505bc0 almost 2 years ago

|

history blame contribute delete

455 Bytes

	---
	language:
	- en
	---

	V1 of an English/code tokenizer. Byte-level BPE, 64k vocab, split digits (the difference with v1). Equal mix between:
	On the NL side:
	- Books
	- C4
	- v1 of our CC (helen quality classifier)
	- enwiki
	- Gutenberg
	- Reddit

	On the code side:
	- Jupyter notebooks (0.5 weight, it was small)
	- GH issues
	- Stackexchange
	- The cleaned Python Stack

	For a total of 1/3 code data (although there is a lot of English in Stackexchange and GH).