HuggingFaceGECLM
/

mix_tok_v1

Model card Files Files and versions Community

mix_tok_v1 / README.md

teven's picture

Update README.md

f9084d8 over 1 year ago

|

history blame contribute delete

416 Bytes

metadata

language:
  - en

V1 of an English/code tokenizer. Byte-level BPE, 64k vocab. Equal mix between: On the NL side:

Books
C4
v1 of our CC (helen quality classifier)
enwiki
Gutenberg
Reddit

On the code side:

Jupyter notebooks (0.5 weight, it was small)
GH issues
Stackexchange
The cleaned Python Stack

For a total of 1/3 code data (although there is a lot of English in Stackexchange and GH).