speakleash/Bielik-11B-v2 · Tokenizer expansion

After observing the challenges with tokenizer expansion, we decided to abandon that direction. Instead, we conducted experiments where we replaced the tokenizer with one we trained ourselves. Soon, we will release version v3 of the model, which incorporates our tokenizer. Replacing the tokenizer is a challenging process that requires addressing several issues along the way (e.g., embedding initialization and a soft training start to prevent the model from degrading early on). However, the result is a model that operates with our tokenizer, which is the most desirable option.