mix_tok_v1 / README.md
teven's picture
Update README.md
f9084d8
metadata
language:
  - en

V1 of an English/code tokenizer. Byte-level BPE, 64k vocab. Equal mix between: On the NL side:

  • Books
  • C4
  • v1 of our CC (helen quality classifier)
  • enwiki
  • Gutenberg
  • Reddit

On the code side:

  • Jupyter notebooks (0.5 weight, it was small)
  • GH issues
  • Stackexchange
  • The cleaned Python Stack

For a total of 1/3 code data (although there is a lot of English in Stackexchange and GH).