In this experiment, I trained a tokenizer that supports multiple Indian languages and merged it with the Llama-3 tokenizer.

STEP 1:

I sampled data from the multilingual(7 Indic languages) aloobun/dhpileIN dataset and trained a SentencePiece tokenizer.

STEP 2:

I evaluated the tokenizer's performance on:

  • Unicode coverage.
  • Token distribution.
  • Tokenization complexity across different scripts.
  • Encoding and decoding capabilities &
  • Edge cases e.g., special characters, numbers, etc.

STEP 2.1:

The first test gives detailed results of the tokenizer's performance on unicode coverage, token distribution visualiztion and complexity across scripts.

Step 2.2:

The second script tests the encoding and decoding capabilities. Example output:

Bengali Analysis:
Original Text Length: 48 characters
Token IDs Count: 11
Token Strings: ['▁আমি', '▁বাংলাদেশ', '▁থেকে', '▁এসে', 'ছি', '।', '▁কলকাতা', '▁একটি', '▁সুন্দর', '▁শহর', '।']
Text Reconstruction: True

Hindi Analysis:
Original Text Length: 49 characters
Token IDs Count: 15
Token Strings: ['▁नम', 'स्ते', ',', '▁मैं', '▁भारत', '▁से', '▁हू', 'ँ', '।', '▁दिल्ली', '▁बहुत', '▁बड़ा', '▁शहर', '▁है', '।']
Text Reconstruction: True

Kannada Analysis:
Original Text Length: 53 characters
Token IDs Count: 13
Token Strings: ['▁ನಾನು', '▁ಬೆಂಗಳೂರಿ', 'ನಿಂದ', '▁ಬಂದ', 'ಿದ್ದೇನೆ', '।', '▁ಕನ್ನಡ', '▁ಒಂದು', '▁ಸೋ', 'ಂಪ', 'ಿನ', '▁ಭಾಷೆ', '।']
Text Reconstruction: True

Malayalam Analysis:
Original Text Length: 47 characters
Token IDs Count: 15
Token Strings: ['▁ഞ', 'ാ', 'ൻ', '▁കേരള', 'ത്തി', 'ൽ', '▁നിന്നാണ്', '.', '▁കൊച്ചി', '▁ഒരു', '▁സുന്ദ', 'ര', '▁നഗ', 'രം', '.']
Text Reconstruction: True

Telugu Analysis:
Original Text Length: 53 characters
Token IDs Count: 10
Token Strings: ['▁నేను', '▁తెలంగాణ', '▁నుంచి', '▁వచ్చ', 'ాను', '.', '▁హైదరాబాద్', '▁అద్భుతమైన', '▁నగరం', '.']
Text Reconstruction: True

Tamil Analysis:
Original Text Length: 54 characters
Token IDs Count: 13
Token Strings: ['▁நான்', '▁தமிழ்நா', 'ட்டை', 'ச்', '▁சேர்ந்த', 'வன்', '.', '▁சென்னை', '▁ஒரு', '▁பெரிய', '▁நக', 'ரம்', '.']
Text Reconstruction: True

Gujarati Analysis:
Original Text Length: 50 characters
Token IDs Count: 12
Token Strings: ['▁હું', '▁ગુજરાત', '▁થી', '▁આવ્યો', '▁છું', '।', '▁અમદાવાદ', '▁એક', '▁સુંદર', '▁શહેર', '▁છે', '।']
Text Reconstruction: True

STEP 3:

This script is used to merge and extend the tokenizer for the Llama3 tokenizer.

Script ensures:

  • No duplicate tokens are added.
  • Tokens arent excessively long.
  • New tokens are correctly integrated.
  • Token mappings, etc

I feel there are some unecessary bloat like token validation and redundant test methods in the script. I'm still working on how to improve things and will update as soon as I have any progress.

Here's a comparison of sub word fertility scores between sarvam-1 and this model.

sarvam-1 IN-Llama-3-Tokenizer
Bengali 1.7 3.52
Gujrati 2.784313 3.588235
Hindi 1.583333 2.933333
Kannada 2.571428 3.976190
Malayalam 3.487804 4.365853
Tamil 2.767441 3.860465
Telugu 2.372093 3.511627
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model’s pipeline type. Check the docs .