Tokenizer?
#4
by
Reza2kn
- opened
Thanks for your work! I'm only a bit confused as it looks like the Tokenizer.model file that is uploaded is the same as the Llama 2 Tokenizer, NOT the one mentioned in the paper with +10,000 added Persian tokens.. I've verified that the contents inside are identical. The other .JSON files related to the added or special tokens are also very short. I'm just looking for your new tokenizer and would appreciate any help!
Thanks!
Thanks for your appreciation! I compared our tokenizer.model
file with the actual tokenizer.model
file of Llama2, and they are indeed different. Our tokenizer.model
file contains 89,449 lines, while Llama2's tokenizer.model
file has 70,285 lines.