Tokenizer expansion

#2
by RASMUS - opened

I initially thought after reading your paper that you used tokenizer expansion but now as we are again pretraining llms and trying continued pretraining and again reading your paper saw that you did not do tokenizer expansion. Have you later on tried it or done any more experimentation?

SpeakLeash | Spichlerz org

After observing the challenges with tokenizer expansion, we decided to abandon that direction. Instead, we conducted experiments where we replaced the tokenizer with one we trained ourselves. Soon, we will release version v3 of the model, which incorporates our tokenizer. Replacing the tokenizer is a challenging process that requires addressing several issues along the way (e.g., embedding initialization and a soft training start to prevent the model from degrading early on). However, the result is a model that operates with our tokenizer, which is the most desirable option.

Sign up or log in to comment