ibrim commited on
Commit
b347aa0
1 Parent(s): 0bc5178

Upload 20 files

Browse files
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2024 Andrej
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,12 +1,150 @@
1
- ---
2
- title: Tokenizer Encode Decode
3
- emoji: 📉
4
- colorFrom: green
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 4.37.2
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # minbpe
2
+
3
+ Minimal, clean code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.
4
+
5
+ This algorithm was popularized for LLMs by the [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and the associated GPT-2 [code release](https://github.com/openai/gpt-2) from OpenAI. [Sennrich et al. 2015](https://arxiv.org/abs/1508.07909) is cited as the original reference for the use of BPE in NLP applications. Today, all modern LLMs (e.g. GPT, Llama, Mistral) use this algorithm to train their tokenizers.
6
+
7
+ There are two Tokenizers in this repository, both of which can perform the 3 primary functions of a Tokenizer: 1) train the tokenizer vocabulary and merges on a given text, 2) encode from text to tokens, 3) decode from tokens to text. The files of the repo are as follows:
8
+
9
+ 1. [minbpe/base.py](minbpe/base.py): Implements the `Tokenizer` class, which is the base class. It contains the `train`, `encode`, and `decode` stubs, save/load functionality, and there are also a few common utility functions. This class is not meant to be used directly, but rather to be inherited from.
10
+ 2. [minbpe/basic.py](minbpe/basic.py): Implements the `BasicTokenizer`, the simplest implementation of the BPE algorithm that runs directly on text.
11
+ 3. [minbpe/regex.py](minbpe/regex.py): Implements the `RegexTokenizer` that further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4. This class also handles special tokens, if any.
12
+ 4. [minbpe/gpt4.py](minbpe/gpt4.py): Implements the `GPT4Tokenizer`. This class is a light wrapper around the `RegexTokenizer` (2, above) that exactly reproduces the tokenization of GPT-4 in the [tiktoken](https://github.com/openai/tiktoken) library. The wrapping handles some details around recovering the exact merges in the tokenizer, and the handling of some unfortunate (and likely historical?) 1-byte token permutations.
13
+
14
+ Finally, the script [train.py](train.py) trains the two major tokenizers on the input text [tests/taylorswift.txt](tests/taylorswift.txt) (this is the Wikipedia entry for her kek) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.
15
+
16
+ All of the files above are very short and thoroughly commented, and also contain a usage example on the bottom of the file.
17
+
18
+ ## quick start
19
+
20
+ As the simplest example, we can reproduce the [Wikipedia article on BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) as follows:
21
+
22
+ ```python
23
+ from minbpe import BasicTokenizer
24
+ tokenizer = BasicTokenizer()
25
+ text = "aaabdaaabac"
26
+ tokenizer.train(text, 256 + 3) # 256 are the byte tokens, then do 3 merges
27
+ print(tokenizer.encode(text))
28
+ # [258, 100, 258, 97, 99]
29
+ print(tokenizer.decode([258, 100, 258, 97, 99]))
30
+ # aaabdaaabac
31
+ tokenizer.save("toy")
32
+ # writes two files: toy.model (for loading) and toy.vocab (for viewing)
33
+ ```
34
+
35
+ According to Wikipedia, running bpe on the input string: "aaabdaaabac" for 3 merges results in the string: "XdXac" where X=ZY, Y=ab, and Z=aa. The tricky thing to note is that minbpe always allocates the 256 individual bytes as tokens, and then merges bytes as needed from there. So for us a=97, b=98, c=99, d=100 (their [ASCII](https://www.asciitable.com) values). Then when (a,a) is merged to Z, Z will become 256. Likewise Y will become 257 and X 258. So we start with the 256 bytes, and do 3 merges to get to the result above, with the expected output of [258, 100, 258, 97, 99].
36
+
37
+ ## inference: GPT-4 comparison
38
+
39
+ We can verify that the `RegexTokenizer` has feature parity with the GPT-4 tokenizer from [tiktoken](https://github.com/openai/tiktoken) as follows:
40
+
41
+ ```python
42
+ text = "hello123!!!? (안녕하세요!) 😉"
43
+
44
+ # tiktoken
45
+ import tiktoken
46
+ enc = tiktoken.get_encoding("cl100k_base")
47
+ print(enc.encode(text))
48
+ # [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]
49
+
50
+ # ours
51
+ from minbpe import GPT4Tokenizer
52
+ tokenizer = GPT4Tokenizer()
53
+ print(tokenizer.encode(text))
54
+ # [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]
55
+ ```
56
+
57
+ (you'll have to `pip install tiktoken` to run). Under the hood, the `GPT4Tokenizer` is just a light wrapper around `RegexTokenizer`, passing in the merges and the special tokens of GPT-4. We can also ensure the special tokens are handled correctly:
58
+
59
+ ```python
60
+ text = "<|endoftext|>hello world"
61
+
62
+ # tiktoken
63
+ import tiktoken
64
+ enc = tiktoken.get_encoding("cl100k_base")
65
+ print(enc.encode(text, allowed_special="all"))
66
+ # [100257, 15339, 1917]
67
+
68
+ # ours
69
+ from minbpe import GPT4Tokenizer
70
+ tokenizer = GPT4Tokenizer()
71
+ print(tokenizer.encode(text, allowed_special="all"))
72
+ # [100257, 15339, 1917]
73
+ ```
74
+
75
+ Note that just like tiktoken, we have to explicitly declare our intent to use and parse special tokens in the call to encode. Otherwise this can become a major footgun, unintentionally tokenizing attacker-controlled data (e.g. user prompts) with special tokens. The `allowed_special` parameter can be set to "all", "none", or a list of special tokens to allow.
76
+
77
+ ## training
78
+
79
+ Unlike tiktoken, this code allows you to train your own tokenizer. In principle and to my knowledge, if you train the `RegexTokenizer` on a large dataset with a vocabulary size of 100K, you would reproduce the GPT-4 tokenizer.
80
+
81
+ There are two paths you can follow. First, you can decide that you don't want the complexity of splitting and preprocessing text with regex patterns, and you also don't care for special tokens. In that case, reach for the `BasicTokenizer`. You can train it, and then encode and decode for example as follows:
82
+
83
+ ```python
84
+ from minbpe import BasicTokenizer
85
+ tokenizer = BasicTokenizer()
86
+ tokenizer.train(very_long_training_string, vocab_size=4096)
87
+ tokenizer.encode("hello world") # string -> tokens
88
+ tokenizer.decode([1000, 2000, 3000]) # tokens -> string
89
+ tokenizer.save("mymodel") # writes mymodel.model and mymodel.vocab
90
+ tokenizer.load("mymodel.model") # loads the model back, the vocab is just for vis
91
+ ```
92
+
93
+ If you instead want to follow along with OpenAI did for their text tokenizer, it's a good idea to adopt their approach of using regex pattern to split the text by categories. The GPT-4 pattern is a default with the `RegexTokenizer`, so you'd simple do something like:
94
+
95
+ ```python
96
+ from minbpe import RegexTokenizer
97
+ tokenizer = RegexTokenizer()
98
+ tokenizer.train(very_long_training_string, vocab_size=32768)
99
+ tokenizer.encode("hello world") # string -> tokens
100
+ tokenizer.decode([1000, 2000, 3000]) # tokens -> string
101
+ tokenizer.save("tok32k") # writes tok32k.model and tok32k.vocab
102
+ tokenizer.load("tok32k.model") # loads the model back from disk
103
+ ```
104
+
105
+ Where, of course, you'd want to change around the vocabulary size depending on the size of your dataset.
106
+
107
+ **Special tokens**. Finally, you might wish to add special tokens to your tokenizer. Register these using the `register_special_tokens` function. For example if you train with vocab_size of 32768, then the first 256 tokens are raw byte tokens, the next 32768-256 are merge tokens, and after those you can add the special tokens. The last "real" merge token will have id of 32767 (vocab_size - 1), so your first special token should come right after that, with an id of exactly 32768. So:
108
+
109
+ ```python
110
+ from minbpe import RegexTokenizer
111
+ tokenizer = RegexTokenizer()
112
+ tokenizer.train(very_long_training_string, vocab_size=32768)
113
+ tokenizer.register_special_tokens({"<|endoftext|>": 32768})
114
+ tokenizer.encode("<|endoftext|>hello world", allowed_special="all")
115
+ ```
116
+
117
+ You can of course add more tokens after that as well, as you like. Finally, I'd like to stress that I tried hard to keep the code itself clean, readable and hackable. You should not have feel scared to read the code and understand how it works. The tests are also a nice place to look for more usage examples. That reminds me:
118
+
119
+ ## tests
120
+
121
+ We use the pytest library for tests. All of them are located in the `tests/` directory. First `pip install pytest` if you haven't already, then:
122
+
123
+ ```bash
124
+ $ pytest -v .
125
+ ```
126
+
127
+ to run the tests. (-v is verbose, slightly prettier).
128
+
129
+ ## community extensions
130
+
131
+ * [gnp/minbpe-rs](https://github.com/gnp/minbpe-rs): A Rust implementation of `minbpe` providing (near) one-to-one correspondence with the Python version
132
+
133
+ ## exercise
134
+
135
+ For those trying to study BPE, here is the advised progression exercise for how you can build your own minbpe step by step. See [exercise.md](exercise.md).
136
+
137
+ ## lecture
138
+
139
+ I built the code in this repository in this [YouTube video](https://www.youtube.com/watch?v=zduSFxRajkE). You can also find this lecture in text form in [lecture.md](lecture.md).
140
+
141
+ ## todos
142
+
143
+ - write a more optimized Python version that could run over large files and big vocabs
144
+ - write an even more optimized C or Rust version (think through)
145
+ - rename GPT4Tokenizer to GPTTokenizer and support GPT-2/GPT-3/GPT-3.5 as well?
146
+ - write a LlamaTokenizer similar to GPT4Tokenizer (i.e. attempt sentencepiece equivalent)
147
+
148
+ ## License
149
+
150
+ MIT
app.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # from minbpe import BasicTokenizer, RegexTokenizer
2
+ # tokenizer = RegexTokenizer()
3
+ # tokenizer.load("first.model")
4
+
5
+ # text_to_encode = "मुझसे क्या होगा अब"
6
+ # encoded_text = tokenizer.encode(text_to_encode)
7
+ # print("Encoded:", encoded_text) # Output: [258, 100, 258, 97, 99]
8
+
9
+ # # Print the tokenized text
10
+ # print("Tokenized Text:", encoded_text)
11
+
12
+ # # Decode text
13
+ # decoded_text = tokenizer.decode(encoded_text)
14
+ # print("Decoded:", decoded_text) # Output: "aaabdaaabac"
15
+
16
+ import gradio as gr
17
+ from minbpe import BasicTokenizer, RegexTokenizer
18
+
19
+ # Initialize the tokenizer
20
+ tokenizer = RegexTokenizer()
21
+ tokenizer.load("first.model")
22
+
23
+ # Define the encoding function
24
+ def encode_text(text):
25
+ encoded_text = tokenizer.encode(text)
26
+ return str(encoded_text)
27
+
28
+ # Define the decoding function
29
+ def decode_text(encoded_text):
30
+ encoded_list = list(map(int, encoded_text.strip('[]').split(',')))
31
+ decoded_text = tokenizer.decode(encoded_list)
32
+ return decoded_text
33
+
34
+ # Define the Gradio interface
35
+ def gradio_app():
36
+ with gr.Blocks() as demo:
37
+ gr.Markdown("# Text Encoder and Decoder")
38
+
39
+ with gr.Row():
40
+ with gr.Column():
41
+ text_input = gr.Textbox(label="Text to Encode")
42
+ encoded_output = gr.Textbox(label="Encoded Text", interactive=False)
43
+ encode_button = gr.Button("Encode")
44
+
45
+ def encode_handler(text):
46
+ return encode_text(text)
47
+
48
+ encode_button.click(fn=encode_handler, inputs=text_input, outputs=encoded_output)
49
+
50
+ with gr.Column():
51
+ encoded_input = gr.Textbox(label="Encoded Text")
52
+ decoded_output = gr.Textbox(label="Decoded Text", interactive=False)
53
+ decode_button = gr.Button("Decode")
54
+
55
+ def decode_handler(encoded_text):
56
+ return decode_text(encoded_text)
57
+
58
+ decode_button.click(fn=decode_handler, inputs=encoded_input, outputs=decoded_output)
59
+
60
+ return demo
61
+
62
+ # Launch the app
63
+ if __name__ == "__main__":
64
+ app = gradio_app()
65
+ app.launch()
assets/tiktokenizer.png ADDED
exercise.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # exercise
2
+
3
+ Build your own GPT-4 Tokenizer!
4
+
5
+ ### Step 1
6
+
7
+ Write the `BasicTokenizer` class, with the following three core functions:
8
+
9
+ - `def train(self, text, vocab_size, verbose=False)`
10
+ - `def encode(self, text)`
11
+ - `def decode(self, ids)`
12
+
13
+ Train your tokenizer on whatever text you like and visualize the merged tokens. Do they look reasonable? One default test you may wish to use is the text file `tests/taylorswift.txt`.
14
+
15
+ ### Step 2
16
+
17
+ Convert you `BasicTokenizer` into a `RegexTokenizer`, which takes a regex pattern and splits the text exactly as GPT-4 would. Process the parts separately as before, then concatenate the results. Retrain your tokenizer and compare the results before and after. You should see that you will now have no tokens that go across categories (numbers, letters, punctuation, more than one whitespace). Use the GPT-4 pattern:
18
+
19
+ ```
20
+ GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
21
+ ```
22
+
23
+
24
+ ### Step 3
25
+
26
+ You're now ready to load the merges from the GPT-4 tokenizer and show that your tokenizer produces the identical results for both `encode` and `decode`, matching [tiktoken](https://github.com/openai/tiktoken).
27
+
28
+ ```
29
+ # match this
30
+ import tiktoken
31
+ enc = tiktoken.get_encoding("cl100k_base") # this is the GPT-4 tokenizer
32
+ ids = enc.encode("hello world!!!? (안녕하세요!) lol123 😉")
33
+ text = enc.decode(ids) # get the same text back
34
+ ```
35
+
36
+ Unfortunately, you will run into two issues:
37
+
38
+ 1. It is not trivial to recover the raw merges from the GPT-4 tokenizer. You can easily recover what we call `vocab` here, and what they call and store under `enc._mergeable_ranks`. Feel free to copy paste the `recover_merges` function in `minbpe/gpt4.py`, which takes these ranks and returns the raw merges. If you wish to know how this function works, read [this](https://github.com/openai/tiktoken/issues/60) and [this](https://github.com/karpathy/minbpe/issues/11#issuecomment-1950805306). Basically, under some conditions it is enough to only store the parent nodes (and their rank) and get rid of the precise details of which children merged up to any parent.
39
+ 2. Second, the GPT-4 tokenizer for some reason permutes its raw bytes. It stores this permutation in the first 256 elements of the mergeable ranks, so you can recover this byte shuffle relatively simply as `byte_shuffle = {i: enc._mergeable_ranks[bytes([i])] for i in range(256)}`. In both your encode and decode, you'll have to shuffle bytes around accordingly. If you're stuck, reference the minbpe/gpt4.py` file for hints.
40
+
41
+ ### Step 4
42
+
43
+ (Optional, irritating, not obviously useful) Add the ability to handle special tokens. You'll then be able to match the output of tiktoken even when special tokens are present, e.g.:
44
+
45
+ ```
46
+ import tiktoken
47
+ enc = tiktoken.get_encoding("cl100k_base") # this is the GPT-4 tokenizer
48
+ ids = enc.encode("<|endoftext|>hello world", allowed_special="all")
49
+ ```
50
+
51
+ Without `allowed_special` tiktoken will error.
52
+
53
+ ### Step 5
54
+
55
+ If you've made it this far, you're now a pro at LLM Tokenization! Sadly, you're not exactly done yet because a lot of LLMs outside of OpenAI (e.g. Llama, Mistral) use [sentencepiece](https://github.com/google/sentencepiece) instead. Primary difference being that sentencepiece runs BPE directly on Unicode code points instead of on UTF-8 encoded bytes. Feel free to explore sentencepiece on your own (good luck, it's not too pretty), and stretch goal if you really experience and suffer from the burden of time, re-write your BPE to be on Unicode code points and match the Llama 2 tokenizer.
first.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e2c490db19a6238afee8a6869e07688db901dcf5f0a30b4a06329c77124f93b
3
+ size 161
first.vocab ADDED
@@ -0,0 +1,261 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [\u0000] 0
2
+ [\u0001] 1
3
+ [\u0002] 2
4
+ [\u0003] 3
5
+ [\u0004] 4
6
+ [\u0005] 5
7
+ [\u0006] 6
8
+ [\u0007] 7
9
+ [\u0008] 8
10
+ [\u0009] 9
11
+ [\u000a] 10
12
+ [\u000b] 11
13
+ [\u000c] 12
14
+ [\u000d] 13
15
+ [\u000e] 14
16
+ [\u000f] 15
17
+ [\u0010] 16
18
+ [\u0011] 17
19
+ [\u0012] 18
20
+ [\u0013] 19
21
+ [\u0014] 20
22
+ [\u0015] 21
23
+ [\u0016] 22
24
+ [\u0017] 23
25
+ [\u0018] 24
26
+ [\u0019] 25
27
+ [\u001a] 26
28
+ [\u001b] 27
29
+ [\u001c] 28
30
+ [\u001d] 29
31
+ [\u001e] 30
32
+ [\u001f] 31
33
+ [ ] 32
34
+ [!] 33
35
+ ["] 34
36
+ [#] 35
37
+ [$] 36
38
+ [%] 37
39
+ [&] 38
40
+ ['] 39
41
+ [(] 40
42
+ [)] 41
43
+ [*] 42
44
+ [+] 43
45
+ [,] 44
46
+ [-] 45
47
+ [.] 46
48
+ [/] 47
49
+ [0] 48
50
+ [1] 49
51
+ [2] 50
52
+ [3] 51
53
+ [4] 52
54
+ [5] 53
55
+ [6] 54
56
+ [7] 55
57
+ [8] 56
58
+ [9] 57
59
+ [:] 58
60
+ [;] 59
61
+ [<] 60
62
+ [=] 61
63
+ [>] 62
64
+ [?] 63
65
+ [@] 64
66
+ [A] 65
67
+ [B] 66
68
+ [C] 67
69
+ [D] 68
70
+ [E] 69
71
+ [F] 70
72
+ [G] 71
73
+ [H] 72
74
+ [I] 73
75
+ [J] 74
76
+ [K] 75
77
+ [L] 76
78
+ [M] 77
79
+ [N] 78
80
+ [O] 79
81
+ [P] 80
82
+ [Q] 81
83
+ [R] 82
84
+ [S] 83
85
+ [T] 84
86
+ [U] 85
87
+ [V] 86
88
+ [W] 87
89
+ [X] 88
90
+ [Y] 89
91
+ [Z] 90
92
+ [[] 91
93
+ [\] 92
94
+ []] 93
95
+ [^] 94
96
+ [_] 95
97
+ [`] 96
98
+ [a] 97
99
+ [b] 98
100
+ [c] 99
101
+ [d] 100
102
+ [e] 101
103
+ [f] 102
104
+ [g] 103
105
+ [h] 104
106
+ [i] 105
107
+ [j] 106
108
+ [k] 107
109
+ [l] 108
110
+ [m] 109
111
+ [n] 110
112
+ [o] 111
113
+ [p] 112
114
+ [q] 113
115
+ [r] 114
116
+ [s] 115
117
+ [t] 116
118
+ [u] 117
119
+ [v] 118
120
+ [w] 119
121
+ [x] 120
122
+ [y] 121
123
+ [z] 122
124
+ [{] 123
125
+ [|] 124
126
+ [}] 125
127
+ [~] 126
128
+ [\u007f] 127
129
+ [�] 128
130
+ [�] 129
131
+ [�] 130
132
+ [�] 131
133
+ [�] 132
134
+ [�] 133
135
+ [�] 134
136
+ [�] 135
137
+ [�] 136
138
+ [�] 137
139
+ [�] 138
140
+ [�] 139
141
+ [�] 140
142
+ [�] 141
143
+ [�] 142
144
+ [�] 143
145
+ [�] 144
146
+ [�] 145
147
+ [�] 146
148
+ [�] 147
149
+ [�] 148
150
+ [�] 149
151
+ [�] 150
152
+ [�] 151
153
+ [�] 152
154
+ [�] 153
155
+ [�] 154
156
+ [�] 155
157
+ [�] 156
158
+ [�] 157
159
+ [�] 158
160
+ [�] 159
161
+ [�] 160
162
+ [�] 161
163
+ [�] 162
164
+ [�] 163
165
+ [�] 164
166
+ [�] 165
167
+ [�] 166
168
+ [�] 167
169
+ [�] 168
170
+ [�] 169
171
+ [�] 170
172
+ [�] 171
173
+ [�] 172
174
+ [�] 173
175
+ [�] 174
176
+ [�] 175
177
+ [�] 176
178
+ [�] 177
179
+ [�] 178
180
+ [�] 179
181
+ [�] 180
182
+ [�] 181
183
+ [�] 182
184
+ [�] 183
185
+ [�] 184
186
+ [�] 185
187
+ [�] 186
188
+ [�] 187
189
+ [�] 188
190
+ [�] 189
191
+ [�] 190
192
+ [�] 191
193
+ [�] 192
194
+ [�] 193
195
+ [�] 194
196
+ [�] 195
197
+ [�] 196
198
+ [�] 197
199
+ [�] 198
200
+ [�] 199
201
+ [�] 200
202
+ [�] 201
203
+ [�] 202
204
+ [�] 203
205
+ [�] 204
206
+ [�] 205
207
+ [�] 206
208
+ [�] 207
209
+ [�] 208
210
+ [�] 209
211
+ [�] 210
212
+ [�] 211
213
+ [�] 212
214
+ [�] 213
215
+ [�] 214
216
+ [�] 215
217
+ [�] 216
218
+ [�] 217
219
+ [�] 218
220
+ [�] 219
221
+ [�] 220
222
+ [�] 221
223
+ [�] 222
224
+ [�] 223
225
+ [�] 224
226
+ [�] 225
227
+ [�] 226
228
+ [�] 227
229
+ [�] 228
230
+ [�] 229
231
+ [�] 230
232
+ [�] 231
233
+ [�] 232
234
+ [�] 233
235
+ [�] 234
236
+ [�] 235
237
+ [�] 236
238
+ [�] 237
239
+ [�] 238
240
+ [�] 239
241
+ [�] 240
242
+ [�] 241
243
+ [�] 242
244
+ [�] 243
245
+ [�] 244
246
+ [�] 245
247
+ [�] 246
248
+ [�] 247
249
+ [�] 248
250
+ [�] 249
251
+ [�] 250
252
+ [�] 251
253
+ [�] 252
254
+ [�] 253
255
+ [�] 254
256
+ [�] 255
257
+ [�][�] -> [�] 256
258
+ [ ][�] -> [ �] 257
259
+ [�][�] -> [�] 258
260
+ [�][�] -> [ा] 259
261
+ [�][�] -> [र] 260
lecture.md ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LLM Tokenization
2
+
3
+ Hi everyone, today we are going to look at Tokenization in Large Language Models (LLMs). Sadly, tokenization is a relatively complex and gnarly component of the state of the art LLMs, but it is necessary to understand in some detail because a lot of the shortcomings of LLMs that may be attributed to the neural network or otherwise appear mysterious actually trace back to tokenization.
4
+
5
+ ### Previously: character-level tokenization
6
+
7
+ So what is tokenization? Well it turns out that in our previous video, [Let's build GPT from scratch](https://www.youtube.com/watch?v=kCc8FmEb1nY), we already covered tokenization but it was only a very simple, naive, character-level version of it. When you go to the [Google colab](https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing) for that video, you'll see that we started with our training data ([Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)), which is just a large string in Python:
8
+
9
+ ```
10
+ First Citizen: Before we proceed any further, hear me speak.
11
+
12
+ All: Speak, speak.
13
+
14
+ First Citizen: You are all resolved rather to die than to famish?
15
+
16
+ All: Resolved. resolved.
17
+
18
+ First Citizen: First, you know Caius Marcius is chief enemy to the people.
19
+
20
+ All: We know't, we know't.
21
+ ```
22
+
23
+ But how do we feed strings into a language model? Well, we saw that we did this by first constructing a vocabulary of all the possible characters we found in the entire training set:
24
+
25
+ ```python
26
+ # here are all the unique characters that occur in this text
27
+ chars = sorted(list(set(text)))
28
+ vocab_size = len(chars)
29
+ print(''.join(chars))
30
+ print(vocab_size)
31
+
32
+ # !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
33
+ # 65
34
+ ```
35
+
36
+ And then creating a lookup table for converting between individual characters and integers according to the vocabulary above. This lookup table was just a Python dictionary:
37
+
38
+ ```python
39
+ stoi = { ch:i for i,ch in enumerate(chars) }
40
+ itos = { i:ch for i,ch in enumerate(chars) }
41
+ # encoder: take a string, output a list of integers
42
+ encode = lambda s: [stoi[c] for c in s]
43
+ # decoder: take a list of integers, output a string
44
+ decode = lambda l: ''.join([itos[i] for i in l])
45
+
46
+ print(encode("hii there"))
47
+ print(decode(encode("hii there")))
48
+
49
+ # [46, 47, 47, 1, 58, 46, 43, 56, 43]
50
+ # hii there
51
+ ```
52
+
53
+ Once we've converted a string into a sequence of integers, we saw that each integer was used as an index into a 2-dimensional embedding of trainable parameters. Because we have a vocabulary size of `vocab_size=65`, this embedding table will also have 65 rows:
54
+
55
+ ```python
56
+ class BigramLanguageModel(nn.Module):
57
+
58
+ def __init__(self, vocab_size):
59
+ super().__init__()
60
+ self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
61
+
62
+ def forward(self, idx, targets=None):
63
+ tok_emb = self.token_embedding_table(idx) # (B,T,C)
64
+ ```
65
+
66
+ Here, the integer "plucks out" a row of this embedding table and this row is the vector that represents this token. This vector then feeds into the Transformer as the input at the corresponding time step.
67
+
68
+ ### "Character chunks" for tokenization using the BPE algorithm
69
+
70
+ This is all well and good for the naive setting of a character-level language model. But in practice, in state of the art language models, people use a lot more complicated schemes for constructing these token vocabularies. In particular, these schemes work not on a character level, but on character chunk level. And the way these chunk vocabularies are constructed is by using algorithms such as the **Byte Pair Encoding** (BPE) algorithm, which we are going to cover in detail below.
71
+
72
+ Turning to the historical development of this approach for a moment, the paper that popularized the use of the byte-level BPE algorithm for language model tokenization is the [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) from OpenAI in 2019, "Language Models are Unsupervised Multitask Learners". Scroll down to Section 2.2 on "Input Representation" where they describe and motivate this algorithm. At the end of this section you'll see them say:
73
+
74
+ > *The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens and a larger batchsize of 512 is used.*
75
+
76
+ Recall that in the attention layer of a Transformer, every token is attending to a finite list of tokens previously in the sequence. The paper here says that the GPT-2 model has a context length of 1024 tokens, up from 512 in GPT-1. In other words, tokens are the fundamental "atoms" at the input to the LLM. And tokenization is the process for taking raw strings in Python and converting them to a list of tokens, and vice versa. As another popular example to demonstrate the pervasiveness of this abstraction, if you go to the [Llama 2](https://arxiv.org/abs/2307.09288) paper as well and you search for "token", you're going to get 63 hits. So for example, the paper claims that they trained on 2 trillion tokens, etc.
77
+
78
+ ### Brief taste of the complexities of tokenization
79
+
80
+ Before we dive into details of the implementation, let's briefly motivate the need to understand the tokenization process in some detail. Tokenization is at the heart of a lot of weirdness in LLMs and I would advise that you do not brush it off. A lot of the issues that may look like issues with the neural network architecture actually trace back to tokenization. Here are just a few examples:
81
+
82
+ - Why can't LLM spell words? **Tokenization**.
83
+ - Why can't LLM do super simple string processing tasks like reversing a string? **Tokenization**.
84
+ - Why is LLM worse at non-English languages (e.g. Japanese)? **Tokenization**.
85
+ - Why is LLM bad at simple arithmetic? **Tokenization**.
86
+ - Why did GPT-2 have more than necessary trouble coding in Python? **Tokenization**.
87
+ - Why did my LLM abruptly halt when it sees the string "<|endoftext|>"? **Tokenization**.
88
+ - What is this weird warning I get about a "trailing whitespace"? **Tokenization**.
89
+ - Why did the LLM break if I ask it about "SolidGoldMagikarp"? **Tokenization**.
90
+ - Why should I prefer to use YAML over JSON with LLMs? **Tokenization**.
91
+ - Why is LLM not actually end-to-end language modeling? **Tokenization**.
92
+ - What is the real root of suffering? **Tokenization**.
93
+
94
+ We will loop back around to these at the end of the video.
95
+
96
+ ### Visual preview of tokenization
97
+
98
+ Next, let's load this [tokenization webapp](https://tiktokenizer.vercel.app). What is nice about this webapp is that tokenization is running live in your web browser, allowing you to easily input some text string at the input, and see the tokenization on the right. On the top, you can see that we are currently using the `gpt2` tokenizer, and we see that the string that we pasted in with this example is currently tokenizing into 300 tokens. Here they are shown explicitly in colors:
99
+
100
+ ![tiktokenizer](assets/tiktokenizer.png)
101
+
102
+ So for example, the string "Tokenization" encoded into the tokens 30642 followed by the token 1634. The token " is" (note that these is three characters, including the space in the front, this is important!) is index 318. Be careful with whitespace because it is absolutely present in the string and must be tokenized along with all the other characters, but is usually omitted in visualization for clarity. You can toggle on and off its visualization at the bottom of the app. In the same way, the token " at" is 379, " the" is 262, etc.
103
+
104
+ Next, we have a simple example of some arithmetic. Here, we see that numbers may be inconsistently decomposed by the tokenizer. For example, the number 127 is a single token of three characters, but the number 677 because two tokens: the token " 6" (again, note the space in the front!) and the token "77". We rely on the large language model to make sense of this arbitrariness. It has to learn inside its parameters and during training that these two tokens (" 6" and "77" actually combine to create the number 677). In the same way, we see that if the LLM wanted to predict that the result of this sum is the number 804, it would have to output that in two time steps: first it has to emit the token " 8", and then the token "04". Note that all of these splits look completely arbitrary. In the example right below, we see that 1275 is "12" followed by "75", 6773 is actually two tokens " 6", "773", and 8041 is " 8", "041".
105
+
106
+ (to be continued...)
107
+ (TODO: may continue this unless we figure out how to generate it automatically from the video :))
minbep.py ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from minbpe import RegexTokenizer
2
+
3
+ # Initialize the tokenizer
4
+ tokenizer = RegexTokenizer()
5
+
6
+ # Read text from a file
7
+ file_path = "/Users/mohammad.ibrahim/Desktop/TSAI/combined_text.txt"
8
+ with open(file_path, 'r', encoding='utf-8') as file:
9
+ text = file.read()
10
+
11
+ # Train the tokenizer
12
+ tokenizer.train(text, 256 + 5) # 256 are the byte tokens, then do 3 merges
13
+
14
+ # Encode the text
15
+ encoded_text = tokenizer.encode(text)
16
+ print("Encoded:", encoded_text)
17
+
18
+ # Decode the text
19
+ decoded_text = tokenizer.decode(encoded_text)
20
+ print("Decoded:", decoded_text)
21
+
22
+ # Save the trained tokenizer model
23
+ tokenizer.save("first") # Writes two files: toy.model (for loading) and toy.vocab (for viewing)
minbpe/__init__.py ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ from .base import Tokenizer
2
+ from .basic import BasicTokenizer
3
+ from .regex import RegexTokenizer
4
+ from .gpt4 import GPT4Tokenizer
minbpe/base.py ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Contains the base Tokenizer class and a few common helper functions.
3
+ The base class also contains the (common) save/load functionality.
4
+ It would be possible to be a lot more strict about the interface and
5
+ e.g. isolating all regex/pattern parts to the RegexTokenizer, but
6
+ some concessions are made for simplicity.
7
+ """
8
+ import unicodedata
9
+
10
+ # -----------------------------------------------------------------------------
11
+ # a few helper functions useful for both BasicTokenizer and RegexTokenizer
12
+
13
+ def get_stats(ids, counts=None):
14
+ """
15
+ Given a list of integers, return a dictionary of counts of consecutive pairs
16
+ Example: [1, 2, 3, 1, 2] -> {(1, 2): 2, (2, 3): 1, (3, 1): 1}
17
+ Optionally allows to update an existing dictionary of counts
18
+ """
19
+ counts = {} if counts is None else counts
20
+ for pair in zip(ids, ids[1:]): # iterate consecutive elements
21
+ counts[pair] = counts.get(pair, 0) + 1
22
+ return counts
23
+
24
+
25
+ def merge(ids, pair, idx):
26
+ """
27
+ In the list of integers (ids), replace all consecutive occurrences
28
+ of pair with the new integer token idx
29
+ Example: ids=[1, 2, 3, 1, 2], pair=(1, 2), idx=4 -> [4, 3, 4]
30
+ """
31
+ newids = []
32
+ i = 0
33
+ while i < len(ids):
34
+ # if not at the very last position AND the pair matches, replace it
35
+ if ids[i] == pair[0] and i < len(ids) - 1 and ids[i+1] == pair[1]:
36
+ newids.append(idx)
37
+ i += 2
38
+ else:
39
+ newids.append(ids[i])
40
+ i += 1
41
+ return newids
42
+
43
+ # first two helper functions...
44
+ def replace_control_characters(s: str) -> str:
45
+ # we don't want to print control characters
46
+ # which distort the output (e.g. \n or much worse)
47
+ # https://stackoverflow.com/questions/4324790/removing-control-characters-from-a-string-in-python/19016117#19016117
48
+ # http://www.unicode.org/reports/tr44/#GC_Values_Table
49
+ chars = []
50
+ for ch in s:
51
+ if unicodedata.category(ch)[0] != "C":
52
+ chars.append(ch) # this character is ok
53
+ else:
54
+ chars.append(f"\\u{ord(ch):04x}") # escape
55
+ return "".join(chars)
56
+
57
+ def render_token(t: bytes) -> str:
58
+ # pretty print a token, escaping control characters
59
+ s = t.decode('utf-8', errors='replace')
60
+ s = replace_control_characters(s)
61
+ return s
62
+
63
+ # -----------------------------------------------------------------------------
64
+ # the base Tokenizer class
65
+
66
+ class Tokenizer:
67
+ """Base class for Tokenizers"""
68
+
69
+ def __init__(self):
70
+ # default: vocab size of 256 (all bytes), no merges, no patterns
71
+ self.merges = {} # (int, int) -> int
72
+ self.pattern = "" # str
73
+ self.special_tokens = {} # str -> int, e.g. {'<|endoftext|>': 100257}
74
+ self.vocab = self._build_vocab() # int -> bytes
75
+
76
+ def train(self, text, vocab_size, verbose=False):
77
+ # Tokenizer can train a vocabulary of size vocab_size from text
78
+ raise NotImplementedError
79
+
80
+ def encode(self, text):
81
+ # Tokenizer can encode a string into a list of integers
82
+ raise NotImplementedError
83
+
84
+ def decode(self, ids):
85
+ # Tokenizer can decode a list of integers into a string
86
+ raise NotImplementedError
87
+
88
+ def _build_vocab(self):
89
+ # vocab is simply and deterministically derived from merges
90
+ vocab = {idx: bytes([idx]) for idx in range(256)}
91
+ for (p0, p1), idx in self.merges.items():
92
+ vocab[idx] = vocab[p0] + vocab[p1]
93
+ for special, idx in self.special_tokens.items():
94
+ vocab[idx] = special.encode("utf-8")
95
+ return vocab
96
+
97
+ def save(self, file_prefix):
98
+ """
99
+ Saves two files: file_prefix.vocab and file_prefix.model
100
+ This is inspired (but not equivalent to!) sentencepiece's model saving:
101
+ - model file is the critical one, intended for load()
102
+ - vocab file is just a pretty printed version for human inspection only
103
+ """
104
+ # write the model: to be used in load() later
105
+ model_file = file_prefix + ".model"
106
+ with open(model_file, 'w') as f:
107
+ # write the version, pattern and merges, that's all that's needed
108
+ f.write("minbpe v1\n")
109
+ f.write(f"{self.pattern}\n")
110
+ # write the special tokens, first the number of them, then each one
111
+ f.write(f"{len(self.special_tokens)}\n")
112
+ for special, idx in self.special_tokens.items():
113
+ f.write(f"{special} {idx}\n")
114
+ # the merges dict
115
+ for idx1, idx2 in self.merges:
116
+ f.write(f"{idx1} {idx2}\n")
117
+ # write the vocab: for the human to look at
118
+ vocab_file = file_prefix + ".vocab"
119
+ inverted_merges = {idx: pair for pair, idx in self.merges.items()}
120
+ with open(vocab_file, "w", encoding="utf-8") as f:
121
+ for idx, token in self.vocab.items():
122
+ # note: many tokens may be partial utf-8 sequences
123
+ # and cannot be decoded into valid strings. Here we're using
124
+ # errors='replace' to replace them with the replacement char �.
125
+ # this also means that we couldn't possibly use .vocab in load()
126
+ # because decoding in this way is a lossy operation!
127
+ s = render_token(token)
128
+ # find the children of this token, if any
129
+ if idx in inverted_merges:
130
+ # if this token has children, render it nicely as a merge
131
+ idx0, idx1 = inverted_merges[idx]
132
+ s0 = render_token(self.vocab[idx0])
133
+ s1 = render_token(self.vocab[idx1])
134
+ f.write(f"[{s0}][{s1}] -> [{s}] {idx}\n")
135
+ else:
136
+ # otherwise this is leaf token, just print it
137
+ # (this should just be the first 256 tokens, the bytes)
138
+ f.write(f"[{s}] {idx}\n")
139
+
140
+ def load(self, model_file):
141
+ """Inverse of save() but only for the model file"""
142
+ assert model_file.endswith(".model")
143
+ # read the model file
144
+ merges = {}
145
+ special_tokens = {}
146
+ idx = 256
147
+ with open(model_file, 'r', encoding="utf-8") as f:
148
+ # read the version
149
+ version = f.readline().strip()
150
+ assert version == "minbpe v1"
151
+ # read the pattern
152
+ self.pattern = f.readline().strip()
153
+ # read the special tokens
154
+ num_special = int(f.readline().strip())
155
+ for _ in range(num_special):
156
+ special, special_idx = f.readline().strip().split()
157
+ special_tokens[special] = int(special_idx)
158
+ # read the merges
159
+ for line in f:
160
+ idx1, idx2 = map(int, line.split())
161
+ merges[(idx1, idx2)] = idx
162
+ idx += 1
163
+ self.merges = merges
164
+ self.special_tokens = special_tokens
165
+ self.vocab = self._build_vocab()
minbpe/basic.py ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Minimal (byte-level) Byte Pair Encoding tokenizer.
3
+
4
+ Algorithmically follows along the GPT tokenizer:
5
+ https://github.com/openai/gpt-2/blob/master/src/encoder.py
6
+
7
+ But:
8
+ - Does not handle the regular expression splitting pattern.
9
+ - Does not handle any special tokens.
10
+ """
11
+
12
+ from .base import Tokenizer, get_stats, merge
13
+
14
+
15
+ class BasicTokenizer(Tokenizer):
16
+
17
+ def __init__(self):
18
+ super().__init__()
19
+
20
+ def train(self, text, vocab_size, verbose=False):
21
+ assert vocab_size >= 256
22
+ num_merges = vocab_size - 256
23
+
24
+ # input text preprocessing
25
+ text_bytes = text.encode("utf-8") # raw bytes
26
+ ids = list(text_bytes) # list of integers in range 0..255
27
+
28
+ # iteratively merge the most common pairs to create new tokens
29
+ merges = {} # (int, int) -> int
30
+ vocab = {idx: bytes([idx]) for idx in range(256)} # int -> bytes
31
+ for i in range(num_merges):
32
+ # count up the number of times every consecutive pair appears
33
+ stats = get_stats(ids)
34
+ # find the pair with the highest count
35
+ pair = max(stats, key=stats.get)
36
+ # mint a new token: assign it the next available id
37
+ idx = 256 + i
38
+ # replace all occurrences of pair in ids with idx
39
+ ids = merge(ids, pair, idx)
40
+ # save the merge
41
+ merges[pair] = idx
42
+ vocab[idx] = vocab[pair[0]] + vocab[pair[1]]
43
+ # prints
44
+ if verbose:
45
+ print(f"merge {i+1}/{num_merges}: {pair} -> {idx} ({vocab[idx]}) had {stats[pair]} occurrences")
46
+
47
+ # save class variables
48
+ self.merges = merges # used in encode()
49
+ self.vocab = vocab # used in decode()
50
+
51
+ def decode(self, ids):
52
+ # given ids (list of integers), return Python string
53
+ text_bytes = b"".join(self.vocab[idx] for idx in ids)
54
+ text = text_bytes.decode("utf-8", errors="replace")
55
+ return text
56
+
57
+ def encode(self, text):
58
+ # given a string text, return the token ids
59
+ text_bytes = text.encode("utf-8") # raw bytes
60
+ ids = list(text_bytes) # list of integers in range 0..255
61
+ while len(ids) >= 2:
62
+ # find the pair with the lowest merge index
63
+ stats = get_stats(ids)
64
+ pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
65
+ # subtle: if there are no more merges available, the key will
66
+ # result in an inf for every single pair, and the min will be
67
+ # just the first pair in the list, arbitrarily
68
+ # we can detect this terminating case by a membership check
69
+ if pair not in self.merges:
70
+ break # nothing else can be merged anymore
71
+ # otherwise let's merge the best pair (lowest merge index)
72
+ idx = self.merges[pair]
73
+ ids = merge(ids, pair, idx)
74
+ return ids
minbpe/gpt4.py ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Implements the GPT-4 Tokenizer as a light wrapper around the RegexTokenizer.
3
+ Note that this is a pretrained tokenizer. By default and inside init(), it
4
+ loads the pretrained tokenizer from the `cl100k_base` tokenizer of tiktoken.
5
+ """
6
+
7
+ import tiktoken
8
+ from .regex import RegexTokenizer
9
+
10
+
11
+ def bpe(mergeable_ranks, token, max_rank):
12
+ # helper function used in get_gpt4_merges() to reconstruct the merge forest
13
+ parts = [bytes([b]) for b in token]
14
+ while True:
15
+ min_idx = None
16
+ min_rank = None
17
+ for i, pair in enumerate(zip(parts[:-1], parts[1:])):
18
+ rank = mergeable_ranks.get(pair[0] + pair[1])
19
+ if rank is not None and (min_rank is None or rank < min_rank):
20
+ min_idx = i
21
+ min_rank = rank
22
+ if min_rank is None or (max_rank is not None and min_rank >= max_rank):
23
+ break
24
+ assert min_idx is not None
25
+ parts = parts[:min_idx] + [parts[min_idx] + parts[min_idx + 1]] + parts[min_idx + 2:]
26
+ return parts
27
+
28
+
29
+ def recover_merges(mergeable_ranks):
30
+ # the `merges` are already the byte sequences in their merged state.
31
+ # so we have to recover the original pairings. We can do this by doing
32
+ # a small BPE training run on all the tokens, in their order.
33
+ # also see https://github.com/openai/tiktoken/issues/60
34
+ # also see https://github.com/karpathy/minbpe/issues/11#issuecomment-1950805306
35
+ merges = {}
36
+ for token, rank in mergeable_ranks.items():
37
+ if len(token) == 1:
38
+ continue # skip raw bytes
39
+ pair = tuple(bpe(mergeable_ranks, token, max_rank=rank))
40
+ assert len(pair) == 2
41
+ # recover the integer ranks of the pair
42
+ ix0 = mergeable_ranks[pair[0]]
43
+ ix1 = mergeable_ranks[pair[1]]
44
+ merges[(ix0, ix1)] = rank
45
+
46
+ return merges
47
+
48
+ GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
49
+ GPT4_SPECIAL_TOKENS = {
50
+ '<|endoftext|>': 100257,
51
+ '<|fim_prefix|>': 100258,
52
+ '<|fim_middle|>': 100259,
53
+ '<|fim_suffix|>': 100260,
54
+ '<|endofprompt|>': 100276
55
+ }
56
+
57
+ class GPT4Tokenizer(RegexTokenizer):
58
+ """Lightweight wrapper on RegexTokenizer that matches GPT-4's tokenizer."""
59
+
60
+ def __init__(self):
61
+ super().__init__(pattern=GPT4_SPLIT_PATTERN)
62
+ # get the official tokenizer and its merges
63
+ enc = tiktoken.get_encoding("cl100k_base")
64
+ mergeable_ranks = enc._mergeable_ranks
65
+ # the merges are those of gpt4, but we have to recover them
66
+ self.merges = recover_merges(mergeable_ranks)
67
+ # reconstruct the vocab from the merges
68
+ vocab = {idx: bytes([idx]) for idx in range(256)}
69
+ for (p0, p1), idx in self.merges.items():
70
+ vocab[idx] = vocab[p0] + vocab[p1]
71
+ self.vocab = vocab
72
+ # now here is another tricky part.
73
+ # for some reason, the tokens corresponding to individual bytes
74
+ # are permuted in a different order. This is completely non-sensical
75
+ # and probably historical, but therefore we have to deal with it here.
76
+ self.byte_shuffle = {i: mergeable_ranks[bytes([i])] for i in range(256)}
77
+ self.inverse_byte_shuffle = {v: k for k, v in self.byte_shuffle.items()}
78
+ # finally register the special tokens
79
+ self.register_special_tokens(GPT4_SPECIAL_TOKENS)
80
+
81
+ def _encode_chunk(self, text_bytes):
82
+ # before we start processing bytes, we have to permute them
83
+ text_bytes = bytes(self.byte_shuffle[b] for b in text_bytes)
84
+ ids = super()._encode_chunk(text_bytes)
85
+ return ids
86
+
87
+ def decode(self, ids):
88
+ # we have to un-permute the bytes before we decode
89
+ text_bytes = b"".join(self.vocab[idx] for idx in ids)
90
+ text_bytes = bytes(self.inverse_byte_shuffle[b] for b in text_bytes)
91
+ text = text_bytes.decode("utf-8", errors="replace")
92
+ return text
93
+
94
+ # this is a pretrained tokenizer, it is not intended to be trained
95
+ def train(self, text, vocab_size, verbose=False):
96
+ raise NotImplementedError
97
+
98
+ # save/load would require some thought.
99
+ # we'd have to change save/load of base to add support for byte_shuffle...
100
+ # alternatively, we could move byte_shuffle to base class, but that would
101
+ # mean that we're making ugly our beautiful Tokenizer just to support
102
+ # the GPT-4 tokenizer and its weird historical quirks around byte_shuffle.
103
+ def save(self, file_prefix):
104
+ raise NotImplementedError("GPT4Tokenizer cannot be saved.")
105
+
106
+ def load(self, model_file):
107
+ raise NotImplementedError("GPT4Tokenizer cannot be loaded.")
108
+
109
+ def save_vocab(self, vocab_file):
110
+ # just for visualization purposes let's output the GPT-4 tokens
111
+ # in the exact same format as the base class would.
112
+ # simple run as:
113
+ # python -c "from minbpe import GPT4Tokenizer; GPT4Tokenizer().save_vocab('gpt4.vocab')"
114
+ from .base import render_token
115
+ # build vocab being mindful of the byte shuffle
116
+ vocab = {idx: bytes([self.inverse_byte_shuffle[idx]]) for idx in range(256)}
117
+ for (p0, p1), idx in self.merges.items():
118
+ vocab[idx] = vocab[p0] + vocab[p1]
119
+ # now merge the shuffled bytes and write to file
120
+ inverted_merges = {idx: pair for pair, idx in self.merges.items()}
121
+ with open(vocab_file, "w", encoding="utf-8") as f:
122
+ for idx, token in vocab.items():
123
+ s = render_token(token)
124
+ if idx in inverted_merges:
125
+ idx0, idx1 = inverted_merges[idx]
126
+ s0 = render_token(vocab[idx0])
127
+ s1 = render_token(vocab[idx1])
128
+ f.write(f"[{s0}][{s1}] -> [{s}] {idx}\n")
129
+ else:
130
+ f.write(f"[{s}] {idx}\n")
minbpe/regex.py ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Minimal (byte-level) Byte Pair Encoding tokenizer.
3
+
4
+ Algorithmically follows along the GPT tokenizer:
5
+ https://github.com/openai/gpt-2/blob/master/src/encoder.py
6
+
7
+ Unlike BasicTokenizer:
8
+ - RegexTokenizer handles an optional regex splitting pattern.
9
+ - RegexTokenizer handles optional special tokens.
10
+ """
11
+
12
+ import regex as re
13
+ from .base import Tokenizer, get_stats, merge
14
+
15
+
16
+ # the main GPT text split patterns, see
17
+ # https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py
18
+ GPT2_SPLIT_PATTERN = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
19
+ GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
20
+
21
+
22
+ class RegexTokenizer(Tokenizer):
23
+
24
+ def __init__(self, pattern=None):
25
+ """
26
+ - pattern: optional string to override the default (GPT-4 split pattern)
27
+ - special_tokens: str -> int dictionary of special tokens
28
+ example: {'<|endoftext|>': 100257}
29
+ """
30
+ super().__init__()
31
+ self.pattern = GPT4_SPLIT_PATTERN if pattern is None else pattern
32
+ self.compiled_pattern = re.compile(self.pattern)
33
+ self.special_tokens = {}
34
+ self.inverse_special_tokens = {}
35
+
36
+ def train(self, text, vocab_size, verbose=False):
37
+ assert vocab_size >= 256
38
+ num_merges = vocab_size - 256
39
+
40
+ # split the text up into text chunks
41
+ text_chunks = re.findall(self.compiled_pattern, text)
42
+
43
+ # input text preprocessing
44
+ ids = [list(ch.encode("utf-8")) for ch in text_chunks]
45
+
46
+ # iteratively merge the most common pairs to create new tokens
47
+ merges = {} # (int, int) -> int
48
+ vocab = {idx: bytes([idx]) for idx in range(256)} # idx -> bytes
49
+ for i in range(num_merges):
50
+ # count the number of times every consecutive pair appears
51
+ stats = {}
52
+ for chunk_ids in ids:
53
+ # passing in stats will update it in place, adding up counts
54
+ get_stats(chunk_ids, stats)
55
+ # find the pair with the highest count
56
+ pair = max(stats, key=stats.get)
57
+ # mint a new token: assign it the next available id
58
+ idx = 256 + i
59
+ # replace all occurrences of pair in ids with idx
60
+ ids = [merge(chunk_ids, pair, idx) for chunk_ids in ids]
61
+ # save the merge
62
+ merges[pair] = idx
63
+ vocab[idx] = vocab[pair[0]] + vocab[pair[1]]
64
+ # prints
65
+ if verbose:
66
+ print(f"merge {i+1}/{num_merges}: {pair} -> {idx} ({vocab[idx]}) had {stats[pair]} occurrences")
67
+
68
+ # save class variables
69
+ self.merges = merges # used in encode()
70
+ self.vocab = vocab # used in decode()
71
+
72
+ def register_special_tokens(self, special_tokens):
73
+ # special_tokens is a dictionary of str -> int
74
+ # example: {"<|endoftext|>": 100257}
75
+ self.special_tokens = special_tokens
76
+ self.inverse_special_tokens = {v: k for k, v in special_tokens.items()}
77
+
78
+ def decode(self, ids):
79
+ # given ids (list of integers), return Python string
80
+ part_bytes = []
81
+ for idx in ids:
82
+ if idx in self.vocab:
83
+ part_bytes.append(self.vocab[idx])
84
+ elif idx in self.inverse_special_tokens:
85
+ part_bytes.append(self.inverse_special_tokens[idx].encode("utf-8"))
86
+ else:
87
+ raise ValueError(f"invalid token id: {idx}")
88
+ text_bytes = b"".join(part_bytes)
89
+ text = text_bytes.decode("utf-8", errors="replace")
90
+ return text
91
+
92
+ def _encode_chunk(self, text_bytes):
93
+ # return the token ids
94
+ # let's begin. first, convert all bytes to integers in range 0..255
95
+ ids = list(text_bytes)
96
+ while len(ids) >= 2:
97
+ # find the pair with the lowest merge index
98
+ stats = get_stats(ids)
99
+ pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
100
+ # subtle: if there are no more merges available, the key will
101
+ # result in an inf for every single pair, and the min will be
102
+ # just the first pair in the list, arbitrarily
103
+ # we can detect this terminating case by a membership check
104
+ if pair not in self.merges:
105
+ break # nothing else can be merged anymore
106
+ # otherwise let's merge the best pair (lowest merge index)
107
+ idx = self.merges[pair]
108
+ ids = merge(ids, pair, idx)
109
+ return ids
110
+
111
+ def encode_ordinary(self, text):
112
+ """Encoding that ignores any special tokens."""
113
+ # split text into chunks of text by categories defined in regex pattern
114
+ text_chunks = re.findall(self.compiled_pattern, text)
115
+ # all chunks of text are encoded separately, then results are joined
116
+ ids = []
117
+ for chunk in text_chunks:
118
+ chunk_bytes = chunk.encode("utf-8") # raw bytes
119
+ chunk_ids = self._encode_chunk(chunk_bytes)
120
+ ids.extend(chunk_ids)
121
+ return ids
122
+
123
+ def encode(self, text, allowed_special="none_raise"):
124
+ """
125
+ Unlike encode_ordinary, this function handles special tokens.
126
+ allowed_special: can be "all"|"none"|"none_raise" or a custom set of special tokens
127
+ if none_raise, then an error is raised if any special token is encountered in text
128
+ this is the default tiktoken behavior right now as well
129
+ any other behavior is either annoying, or a major footgun
130
+ """
131
+ # decode the user desire w.r.t. handling of special tokens
132
+ special = None
133
+ if allowed_special == "all":
134
+ special = self.special_tokens
135
+ elif allowed_special == "none":
136
+ special = {}
137
+ elif allowed_special == "none_raise":
138
+ special = {}
139
+ assert all(token not in text for token in self.special_tokens)
140
+ elif isinstance(allowed_special, set):
141
+ special = {k: v for k, v in self.special_tokens.items() if k in allowed_special}
142
+ else:
143
+ raise ValueError(f"allowed_special={allowed_special} not understood")
144
+ if not special:
145
+ # shortcut: if no special tokens, just use the ordinary encoding
146
+ return self.encode_ordinary(text)
147
+ # otherwise, we have to be careful with potential special tokens in text
148
+ # we handle special tokens by splitting the text
149
+ # based on the occurrence of any exact match with any of the special tokens
150
+ # we can use re.split for this. note that surrounding the pattern with ()
151
+ # makes it into a capturing group, so the special tokens will be included
152
+ special_pattern = "(" + "|".join(re.escape(k) for k in special) + ")"
153
+ special_chunks = re.split(special_pattern, text)
154
+ # now all the special characters are separated from the rest of the text
155
+ # all chunks of text are encoded separately, then results are joined
156
+ ids = []
157
+ for part in special_chunks:
158
+ if part in special:
159
+ # this is a special token, encode it separately as a special case
160
+ ids.append(special[part])
161
+ else:
162
+ # this is an ordinary sequence, encode it normally
163
+ ids.extend(self.encode_ordinary(part))
164
+ return ids
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ regex
2
+ tiktoken
tests/__init__.py ADDED
File without changes
tests/taylorswift.txt ADDED
The diff for this file is too large to render. See raw diff
 
tests/test_tokenizer.py ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pytest
2
+ import tiktoken
3
+ import os
4
+
5
+ from minbpe import BasicTokenizer, RegexTokenizer, GPT4Tokenizer
6
+
7
+ # -----------------------------------------------------------------------------
8
+ # common test data
9
+
10
+ # a few strings to test the tokenizers on
11
+ test_strings = [
12
+ "", # empty string
13
+ "?", # single character
14
+ "hello world!!!? (안녕하세요!) lol123 😉", # fun small string
15
+ "FILE:taylorswift.txt", # FILE: is handled as a special string in unpack()
16
+ ]
17
+ def unpack(text):
18
+ # we do this because `pytest -v .` prints the arguments to console, and we don't
19
+ # want to print the entire contents of the file, it creates a mess. So here we go.
20
+ if text.startswith("FILE:"):
21
+ dirname = os.path.dirname(os.path.abspath(__file__))
22
+ taylorswift_file = os.path.join(dirname, text[5:])
23
+ contents = open(taylorswift_file, "r", encoding="utf-8").read()
24
+ return contents
25
+ else:
26
+ return text
27
+
28
+ specials_string = """
29
+ <|endoftext|>Hello world this is one document
30
+ <|endoftext|>And this is another document
31
+ <|endoftext|><|fim_prefix|>And this one has<|fim_suffix|> tokens.<|fim_middle|> FIM
32
+ <|endoftext|>Last document!!! 👋<|endofprompt|>
33
+ """.strip()
34
+ special_tokens = {
35
+ '<|endoftext|>': 100257,
36
+ '<|fim_prefix|>': 100258,
37
+ '<|fim_middle|>': 100259,
38
+ '<|fim_suffix|>': 100260,
39
+ '<|endofprompt|>': 100276
40
+ }
41
+ llama_text = """
42
+ <|endoftext|>The llama (/ˈlɑːmə/; Spanish pronunciation: [ˈʎama] or [ˈʝama]) (Lama glama) is a domesticated South American camelid, widely used as a meat and pack animal by Andean cultures since the pre-Columbian era.
43
+ Llamas are social animals and live with others as a herd. Their wool is soft and contains only a small amount of lanolin.[2] Llamas can learn simple tasks after a few repetitions. When using a pack, they can carry about 25 to 30% of their body weight for 8 to 13 km (5–8 miles).[3] The name llama (in the past also spelled "lama" or "glama") was adopted by European settlers from native Peruvians.[4]
44
+ The ancestors of llamas are thought to have originated from the Great Plains of North America about 40 million years ago, and subsequently migrated to South America about three million years ago during the Great American Interchange. By the end of the last ice age (10,000–12,000 years ago), camelids were extinct in North America.[3] As of 2007, there were over seven million llamas and alpacas in South America and over 158,000 llamas and 100,000 alpacas, descended from progenitors imported late in the 20th century, in the United States and Canada.[5]
45
+ <|fim_prefix|>In Aymara mythology, llamas are important beings. The Heavenly Llama is said to drink water from the ocean and urinates as it rains.[6] According to Aymara eschatology,<|fim_suffix|> where they come from at the end of time.[6]<|fim_middle|> llamas will return to the water springs and ponds<|endofprompt|>
46
+ """.strip()
47
+
48
+ # -----------------------------------------------------------------------------
49
+ # tests
50
+
51
+ # test encode/decode identity for a few different strings
52
+ @pytest.mark.parametrize("tokenizer_factory", [BasicTokenizer, RegexTokenizer, GPT4Tokenizer])
53
+ @pytest.mark.parametrize("text", test_strings)
54
+ def test_encode_decode_identity(tokenizer_factory, text):
55
+ text = unpack(text)
56
+ tokenizer = tokenizer_factory()
57
+ ids = tokenizer.encode(text)
58
+ decoded = tokenizer.decode(ids)
59
+ assert text == decoded
60
+
61
+ # test that our tokenizer matches the official GPT-4 tokenizer
62
+ @pytest.mark.parametrize("text", test_strings)
63
+ def test_gpt4_tiktoken_equality(text):
64
+ text = unpack(text)
65
+ tokenizer = GPT4Tokenizer()
66
+ enc = tiktoken.get_encoding("cl100k_base")
67
+ tiktoken_ids = enc.encode(text)
68
+ gpt4_tokenizer_ids = tokenizer.encode(text)
69
+ assert gpt4_tokenizer_ids == tiktoken_ids
70
+
71
+ # test the handling of special tokens
72
+ def test_gpt4_tiktoken_equality_special_tokens():
73
+ tokenizer = GPT4Tokenizer()
74
+ enc = tiktoken.get_encoding("cl100k_base")
75
+ tiktoken_ids = enc.encode(specials_string, allowed_special="all")
76
+ gpt4_tokenizer_ids = tokenizer.encode(specials_string, allowed_special="all")
77
+ assert gpt4_tokenizer_ids == tiktoken_ids
78
+
79
+ # reference test to add more tests in the future
80
+ @pytest.mark.parametrize("tokenizer_factory", [BasicTokenizer, RegexTokenizer])
81
+ def test_wikipedia_example(tokenizer_factory):
82
+ """
83
+ Quick unit test, following along the Wikipedia example:
84
+ https://en.wikipedia.org/wiki/Byte_pair_encoding
85
+
86
+ According to Wikipedia, running bpe on the input string:
87
+ "aaabdaaabac"
88
+
89
+ for 3 merges will result in string:
90
+ "XdXac"
91
+
92
+ where:
93
+ X=ZY
94
+ Y=ab
95
+ Z=aa
96
+
97
+ Keep in mind that for us a=97, b=98, c=99, d=100 (ASCII values)
98
+ so Z will be 256, Y will be 257, X will be 258.
99
+
100
+ So we expect the output list of ids to be [258, 100, 258, 97, 99]
101
+ """
102
+ tokenizer = tokenizer_factory()
103
+ text = "aaabdaaabac"
104
+ tokenizer.train(text, 256 + 3)
105
+ ids = tokenizer.encode(text)
106
+ assert ids == [258, 100, 258, 97, 99]
107
+ assert tokenizer.decode(tokenizer.encode(text)) == text
108
+
109
+ @pytest.mark.parametrize("special_tokens", [{}, special_tokens])
110
+ def test_save_load(special_tokens):
111
+ # take a bit more complex piece of text and train the tokenizer, chosen at random
112
+ text = llama_text
113
+ # create a Tokenizer and do 64 merges
114
+ tokenizer = RegexTokenizer()
115
+ tokenizer.train(text, 256 + 64)
116
+ tokenizer.register_special_tokens(special_tokens)
117
+ # verify that decode(encode(x)) == x
118
+ assert tokenizer.decode(tokenizer.encode(text, "all")) == text
119
+ # verify that save/load work as expected
120
+ ids = tokenizer.encode(text, "all")
121
+ # save the tokenizer (TODO use a proper temporary directory)
122
+ tokenizer.save("test_tokenizer_tmp")
123
+ # re-load the tokenizer
124
+ tokenizer = RegexTokenizer()
125
+ tokenizer.load("test_tokenizer_tmp.model")
126
+ # verify that decode(encode(x)) == x
127
+ assert tokenizer.decode(ids) == text
128
+ assert tokenizer.decode(tokenizer.encode(text, "all")) == text
129
+ assert tokenizer.encode(text, "all") == ids
130
+ # delete the temporary files
131
+ for file in ["test_tokenizer_tmp.model", "test_tokenizer_tmp.vocab"]:
132
+ os.remove(file)
133
+
134
+ if __name__ == "__main__":
135
+ pytest.main()
tokenize.ipynb ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "metadata": {},
7
+ "outputs": [],
8
+ "source": [
9
+ "file_path = \"/Users/mohammad.ibrahim/Desktop/TSAI/combined_text.txt\"\n",
10
+ "with open(file_path, 'r', encoding='utf-8') as file:\n",
11
+ " text = file.read()"
12
+ ]
13
+ },
14
+ {
15
+ "cell_type": "code",
16
+ "execution_count": 5,
17
+ "metadata": {},
18
+ "outputs": [],
19
+ "source": [
20
+ "pattern = r\"\"\"'(?i:[sdmt]|ll|ve|re)|[^\\r\\n\\p{L}\\p{N}।•]?+\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}।•]++[\\r\\n]*|\\s*[\\r\\n]|\\s+(?!\\S)|\\s+|।|•\"\"\"\n"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "code",
25
+ "execution_count": 6,
26
+ "metadata": {},
27
+ "outputs": [],
28
+ "source": [
29
+ "import regex as re\n",
30
+ "text_chunks = re.findall(pattern, text)\n",
31
+ "\n",
32
+ " # input text preprocessing\n",
33
+ "tokens = [list(ch.encode(\"utf-8\")) for ch in text_chunks]"
34
+ ]
35
+ },
36
+ {
37
+ "cell_type": "code",
38
+ "execution_count": 2,
39
+ "metadata": {},
40
+ "outputs": [],
41
+ "source": [
42
+ "# tokens = text.encode(\"utf-8\") # raw bytes\n",
43
+ "tokens = list(map(int, tokens)) # convert to a list of integers in range 0..255 for convenience"
44
+ ]
45
+ },
46
+ {
47
+ "cell_type": "code",
48
+ "execution_count": 3,
49
+ "metadata": {},
50
+ "outputs": [
51
+ {
52
+ "name": "stdout",
53
+ "output_type": "stream",
54
+ "text": [
55
+ "tokens length: 179910393\n",
56
+ "ids length: 32798069\n",
57
+ "compression ratio: 5.49X\n"
58
+ ]
59
+ }
60
+ ],
61
+ "source": [
62
+ "def get_stats(ids):\n",
63
+ " counts = {}\n",
64
+ " for pair in zip(ids, ids[1:]):\n",
65
+ " counts[pair] = counts.get(pair, 0) + 1\n",
66
+ " return counts\n",
67
+ "\n",
68
+ "def merge(ids, pair, idx):\n",
69
+ " newids = []\n",
70
+ " i = 0\n",
71
+ " while i < len(ids):\n",
72
+ " if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:\n",
73
+ " newids.append(idx)\n",
74
+ " i += 2\n",
75
+ " else:\n",
76
+ " newids.append(ids[i])\n",
77
+ " i += 1\n",
78
+ " return newids\n",
79
+ "\n",
80
+ "# ---\n",
81
+ "vocab_size = 1000 # the desired final vocabulary size\n",
82
+ "num_merges = vocab_size - 256\n",
83
+ "ids = list(tokens) # copy so we don't destroy the original list\n",
84
+ "\n",
85
+ "merges = {} # (int, int) -> int\n",
86
+ "for i in range(num_merges):\n",
87
+ " stats = get_stats(ids)\n",
88
+ " pair = max(stats, key=stats.get)\n",
89
+ " idx = 256 + i\n",
90
+ " # print(f\"merging {pair} into a new token {idx}\")\n",
91
+ " ids = merge(ids, pair, idx)\n",
92
+ " merges[pair] = idx\n",
93
+ "\n",
94
+ "print(\"tokens length:\", len(tokens))\n",
95
+ "print(\"ids length:\", len(ids))\n",
96
+ "print(f\"compression ratio: {len(tokens) / len(ids):.2f}X\")"
97
+ ]
98
+ },
99
+ {
100
+ "cell_type": "code",
101
+ "execution_count": null,
102
+ "metadata": {},
103
+ "outputs": [],
104
+ "source": []
105
+ }
106
+ ],
107
+ "metadata": {
108
+ "kernelspec": {
109
+ "display_name": "Python 3",
110
+ "language": "python",
111
+ "name": "python3"
112
+ },
113
+ "language_info": {
114
+ "codemirror_mode": {
115
+ "name": "ipython",
116
+ "version": 3
117
+ },
118
+ "file_extension": ".py",
119
+ "mimetype": "text/x-python",
120
+ "name": "python",
121
+ "nbconvert_exporter": "python",
122
+ "pygments_lexer": "ipython3",
123
+ "version": "3.9.6"
124
+ }
125
+ },
126
+ "nbformat": 4,
127
+ "nbformat_minor": 2
128
+ }
train.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Train our Tokenizers on some data, just to see them in action.
3
+ The whole thing runs in ~25 seconds on my laptop.
4
+ """
5
+
6
+ import os
7
+ import time
8
+ from minbpe import BasicTokenizer, RegexTokenizer
9
+
10
+ # open some text and train a vocab of 512 tokens
11
+ text = open("tests/taylorswift.txt", "r", encoding="utf-8").read()
12
+
13
+ # create a directory for models, so we don't pollute the current directory
14
+ os.makedirs("models", exist_ok=True)
15
+
16
+ t0 = time.time()
17
+ for TokenizerClass, name in zip([BasicTokenizer, RegexTokenizer], ["basic", "regex"]):
18
+
19
+ # construct the Tokenizer object and kick off verbose training
20
+ tokenizer = TokenizerClass()
21
+ tokenizer.train(text, 512, verbose=True)
22
+ # writes two files in the models directory: name.model, and name.vocab
23
+ prefix = os.path.join("models", name)
24
+ tokenizer.save(prefix)
25
+ t1 = time.time()
26
+
27
+ print(f"Training took {t1 - t0:.2f} seconds")