Spaces:

ibrim
/

Tokenizer-Encode-Decode

Running

App Files Files Community

ibrim commited on Jul 10

Commit

b347aa0

•

1 Parent(s): 0bc5178

Upload 20 files

Browse files

Files changed (20) hide show

LICENSE +21 -0
README.md +150 -12
app.py +65 -0
assets/tiktokenizer.png +0 -0
exercise.md +55 -0
first.model +3 -0
first.vocab +261 -0
lecture.md +107 -0
minbep.py +23 -0
minbpe/__init__.py +4 -0
minbpe/base.py +165 -0
minbpe/basic.py +74 -0
minbpe/gpt4.py +130 -0
minbpe/regex.py +164 -0
requirements.txt +2 -0
tests/__init__.py +0 -0
tests/taylorswift.txt +0 -0
tests/test_tokenizer.py +135 -0
tokenize.ipynb +128 -0
train.py +27 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2024 Andrej
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,12 +1,150 @@
----
-title: Tokenizer Encode Decode
-emoji: 📉
-colorFrom: green
-colorTo: purple
-sdk: gradio
-sdk_version: 4.37.2
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# minbpe
+Minimal, clean code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.
+This algorithm was popularized for LLMs by the [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and the associated GPT-2 [code release](https://github.com/openai/gpt-2) from OpenAI. [Sennrich et al. 2015](https://arxiv.org/abs/1508.07909) is cited as the original reference for the use of BPE in NLP applications. Today, all modern LLMs (e.g. GPT, Llama, Mistral) use this algorithm to train their tokenizers.
+There are two Tokenizers in this repository, both of which can perform the 3 primary functions of a Tokenizer: 1) train the tokenizer vocabulary and merges on a given text, 2) encode from text to tokens, 3) decode from tokens to text. The files of the repo are as follows:
+1. [minbpe/base.py](minbpe/base.py): Implements the `Tokenizer` class, which is the base class. It contains the `train`, `encode`, and `decode` stubs, save/load functionality, and there are also a few common utility functions. This class is not meant to be used directly, but rather to be inherited from.
+2. [minbpe/basic.py](minbpe/basic.py): Implements the `BasicTokenizer`, the simplest implementation of the BPE algorithm that runs directly on text.
+3. [minbpe/regex.py](minbpe/regex.py): Implements the `RegexTokenizer` that further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4. This class also handles special tokens, if any.
+4. [minbpe/gpt4.py](minbpe/gpt4.py): Implements the `GPT4Tokenizer`. This class is a light wrapper around the `RegexTokenizer` (2, above) that exactly reproduces the tokenization of GPT-4 in the [tiktoken](https://github.com/openai/tiktoken) library. The wrapping handles some details around recovering the exact merges in the tokenizer, and the handling of some unfortunate (and likely historical?) 1-byte token permutations.
+Finally, the script [train.py](train.py) trains the two major tokenizers on the input text [tests/taylorswift.txt](tests/taylorswift.txt) (this is the Wikipedia entry for her kek) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.
+All of the files above are very short and thoroughly commented, and also contain a usage example on the bottom of the file.
+## quick start
+As the simplest example, we can reproduce the [Wikipedia article on BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) as follows:
+```python
+from minbpe import BasicTokenizer
+tokenizer = BasicTokenizer()
+text = "aaabdaaabac"
+tokenizer.train(text, 256 + 3) # 256 are the byte tokens, then do 3 merges
+print(tokenizer.encode(text))
+# [258, 100, 258, 97, 99]
+print(tokenizer.decode([258, 100, 258, 97, 99]))
+# aaabdaaabac
+tokenizer.save("toy")
+# writes two files: toy.model (for loading) and toy.vocab (for viewing)
+```
+According to Wikipedia, running bpe on the input string: "aaabdaaabac" for 3 merges results in the string: "XdXac" where  X=ZY, Y=ab, and Z=aa. The tricky thing to note is that minbpe always allocates the 256 individual bytes as tokens, and then merges bytes as needed from there. So for us a=97, b=98, c=99, d=100 (their [ASCII](https://www.asciitable.com) values). Then when (a,a) is merged to Z, Z will become 256. Likewise Y will become 257 and X 258. So we start with the 256 bytes, and do 3 merges to get to the result above, with the expected output of [258, 100, 258, 97, 99].
+## inference: GPT-4 comparison
+We can verify that the `RegexTokenizer` has feature parity with the GPT-4 tokenizer from [tiktoken](https://github.com/openai/tiktoken) as follows:
+```python
+text = "hello123!!!? (안녕하세요!) 😉"
+# tiktoken
+import tiktoken
+enc = tiktoken.get_encoding("cl100k_base")
+print(enc.encode(text))
+# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]
+# ours
+from minbpe import GPT4Tokenizer
+tokenizer = GPT4Tokenizer()
+print(tokenizer.encode(text))
+# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]
+```
+(you'll have to `pip install tiktoken` to run). Under the hood, the `GPT4Tokenizer` is just a light wrapper around `RegexTokenizer`, passing in the merges and the special tokens of GPT-4. We can also ensure the special tokens are handled correctly:
+```python
+text = "<|endoftext|>hello world"
+# tiktoken
+import tiktoken
+enc = tiktoken.get_encoding("cl100k_base")
+print(enc.encode(text, allowed_special="all"))
+# [100257, 15339, 1917]
+# ours
+from minbpe import GPT4Tokenizer
+tokenizer = GPT4Tokenizer()
+print(tokenizer.encode(text, allowed_special="all"))
+# [100257, 15339, 1917]
+```
+Note that just like tiktoken, we have to explicitly declare our intent to use and parse special tokens in the call to encode. Otherwise this can become a major footgun, unintentionally tokenizing attacker-controlled data (e.g. user prompts) with special tokens. The `allowed_special` parameter can be set to "all", "none", or a list of special tokens to allow.
+## training
+Unlike tiktoken, this code allows you to train your own tokenizer. In principle and to my knowledge, if you train the `RegexTokenizer` on a large dataset with a vocabulary size of 100K, you would reproduce the GPT-4 tokenizer.
+There are two paths you can follow. First, you can decide that you don't want the complexity of splitting and preprocessing text with regex patterns, and you also don't care for special tokens. In that case, reach for the `BasicTokenizer`. You can train it, and then encode and decode for example as follows:
+```python
+from minbpe import BasicTokenizer
+tokenizer = BasicTokenizer()
+tokenizer.train(very_long_training_string, vocab_size=4096)
+tokenizer.encode("hello world") # string -> tokens
+tokenizer.decode([1000, 2000, 3000]) # tokens -> string
+tokenizer.save("mymodel") # writes mymodel.model and mymodel.vocab
+tokenizer.load("mymodel.model") # loads the model back, the vocab is just for vis
+```
+If you instead want to follow along with OpenAI did for their text tokenizer, it's a good idea to adopt their approach of using regex pattern to split the text by categories. The GPT-4 pattern is a default with the `RegexTokenizer`, so you'd simple do something like:
+```python
+from minbpe import RegexTokenizer
+tokenizer = RegexTokenizer()
+tokenizer.train(very_long_training_string, vocab_size=32768)
+tokenizer.encode("hello world") # string -> tokens
+tokenizer.decode([1000, 2000, 3000]) # tokens -> string
+tokenizer.save("tok32k") # writes tok32k.model and tok32k.vocab
+tokenizer.load("tok32k.model") # loads the model back from disk
+```
+Where, of course, you'd want to change around the vocabulary size depending on the size of your dataset.
+**Special tokens**. Finally, you might wish to add special tokens to your tokenizer. Register these using the `register_special_tokens` function. For example if you train with vocab_size of 32768, then the first 256 tokens are raw byte tokens, the next 32768-256 are merge tokens, and after those you can add the special tokens. The last "real" merge token will have id of 32767 (vocab_size - 1), so your first special token should come right after that, with an id of exactly 32768. So:
+```python
+from minbpe import RegexTokenizer
+tokenizer = RegexTokenizer()
+tokenizer.train(very_long_training_string, vocab_size=32768)
+tokenizer.register_special_tokens({"<|endoftext|>": 32768})
+tokenizer.encode("<|endoftext|>hello world", allowed_special="all")
+```
+You can of course add more tokens after that as well, as you like. Finally, I'd like to stress that I tried hard to keep the code itself clean, readable and hackable. You should not have feel scared to read the code and understand how it works. The tests are also a nice place to look for more usage examples. That reminds me:
+## tests
+We use the pytest library for tests. All of them are located in the `tests/` directory. First `pip install pytest` if you haven't already, then:
+```bash
+$ pytest -v .
+```
+to run the tests. (-v is verbose, slightly prettier).
+## community extensions
+* [gnp/minbpe-rs](https://github.com/gnp/minbpe-rs): A Rust implementation of `minbpe` providing (near) one-to-one correspondence with the Python version
+## exercise
+For those trying to study BPE, here is the advised progression exercise for how you can build your own minbpe step by step. See [exercise.md](exercise.md).
+## lecture
+I built the code in this repository in this [YouTube video](https://www.youtube.com/watch?v=zduSFxRajkE). You can also find this lecture in text form in [lecture.md](lecture.md).
+## todos
+- write a more optimized Python version that could run over large files and big vocabs
+- write an even more optimized C or Rust version (think through)
+- rename GPT4Tokenizer to GPTTokenizer and support GPT-2/GPT-3/GPT-3.5 as well?
+- write a LlamaTokenizer similar to GPT4Tokenizer (i.e. attempt sentencepiece equivalent)
+## License
+MIT

app.py ADDED Viewed

	@@ -0,0 +1,65 @@

+# from minbpe import BasicTokenizer, RegexTokenizer
+# tokenizer = RegexTokenizer()
+# tokenizer.load("first.model")
+# text_to_encode = "मुझसे क्या होगा अब"
+# encoded_text = tokenizer.encode(text_to_encode)
+# print("Encoded:", encoded_text)  # Output: [258, 100, 258, 97, 99]
+# # Print the tokenized text
+# print("Tokenized Text:", encoded_text)
+# # Decode text
+# decoded_text = tokenizer.decode(encoded_text)
+# print("Decoded:", decoded_text)  # Output: "aaabdaaabac"
+import gradio as gr
+from minbpe import BasicTokenizer, RegexTokenizer
+# Initialize the tokenizer
+tokenizer = RegexTokenizer()
+tokenizer.load("first.model")
+# Define the encoding function
+def encode_text(text):
+    encoded_text = tokenizer.encode(text)
+    return str(encoded_text)
+# Define the decoding function
+def decode_text(encoded_text):
+    encoded_list = list(map(int, encoded_text.strip('[]').split(',')))
+    decoded_text = tokenizer.decode(encoded_list)
+    return decoded_text
+# Define the Gradio interface
+def gradio_app():
+    with gr.Blocks() as demo:
+        gr.Markdown("# Text Encoder and Decoder")
+        with gr.Row():
+            with gr.Column():
+                text_input = gr.Textbox(label="Text to Encode")
+                encoded_output = gr.Textbox(label="Encoded Text", interactive=False)
+                encode_button = gr.Button("Encode")
+                def encode_handler(text):
+                    return encode_text(text)
+                encode_button.click(fn=encode_handler, inputs=text_input, outputs=encoded_output)
+            with gr.Column():
+                encoded_input = gr.Textbox(label="Encoded Text")
+                decoded_output = gr.Textbox(label="Decoded Text", interactive=False)
+                decode_button = gr.Button("Decode")
+                def decode_handler(encoded_text):
+                    return decode_text(encoded_text)
+                decode_button.click(fn=decode_handler, inputs=encoded_input, outputs=decoded_output)
+    return demo
+# Launch the app
+if __name__ == "__main__":
+    app = gradio_app()
+    app.launch()

assets/tiktokenizer.png ADDED Viewed

exercise.md ADDED Viewed

	@@ -0,0 +1,55 @@

+# exercise
+Build your own GPT-4 Tokenizer!
+### Step 1
+Write the `BasicTokenizer` class, with the following three core functions:
+- `def train(self, text, vocab_size, verbose=False)`
+- `def encode(self, text)`
+- `def decode(self, ids)`
+Train your tokenizer on whatever text you like and visualize the merged tokens. Do they look reasonable? One default test you may wish to use is the text file `tests/taylorswift.txt`.
+### Step 2
+Convert you `BasicTokenizer` into a `RegexTokenizer`, which takes a regex pattern and splits the text exactly as GPT-4 would. Process the parts separately as before, then concatenate the results. Retrain your tokenizer and compare the results before and after. You should see that you will now have no tokens that go across categories (numbers, letters, punctuation, more than one whitespace). Use the GPT-4 pattern:
+```
+GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
+```
+### Step 3
+You're now ready to load the merges from the GPT-4 tokenizer and show that your tokenizer produces the identical results for both `encode` and `decode`, matching [tiktoken](https://github.com/openai/tiktoken).
+```
+# match this
+import tiktoken
+enc = tiktoken.get_encoding("cl100k_base") # this is the GPT-4 tokenizer
+ids = enc.encode("hello world!!!? (안녕하세요!) lol123 😉")
+text = enc.decode(ids) # get the same text back
+```
+Unfortunately, you will run into two issues:
+1. It is not trivial to recover the raw merges from the GPT-4 tokenizer. You can easily recover what we call `vocab` here, and what they call and store under `enc._mergeable_ranks`. Feel free to copy paste the `recover_merges` function in `minbpe/gpt4.py`, which takes these ranks and returns the raw merges. If you wish to know how this function works, read [this](https://github.com/openai/tiktoken/issues/60) and [this](https://github.com/karpathy/minbpe/issues/11#issuecomment-1950805306). Basically, under some conditions it is enough to only store the parent nodes (and their rank) and get rid of the precise details of which children merged up to any parent.
+2. Second, the GPT-4 tokenizer for some reason permutes its raw bytes. It stores this permutation in the first 256 elements of the mergeable ranks, so you can recover this byte shuffle relatively simply as `byte_shuffle = {i: enc._mergeable_ranks[bytes([i])] for i in range(256)}`. In both your encode and decode, you'll have to shuffle bytes around accordingly. If you're stuck, reference the minbpe/gpt4.py` file for hints.
+### Step 4
+(Optional, irritating, not obviously useful) Add the ability to handle special tokens. You'll then be able to match the output of tiktoken even when special tokens are present, e.g.:
+```
+import tiktoken
+enc = tiktoken.get_encoding("cl100k_base") # this is the GPT-4 tokenizer
+ids = enc.encode("<|endoftext|>hello world", allowed_special="all")
+```
+Without `allowed_special` tiktoken will error.
+### Step 5
+If you've made it this far, you're now a pro at LLM Tokenization! Sadly, you're not exactly done yet because a lot of LLMs outside of OpenAI (e.g. Llama, Mistral) use [sentencepiece](https://github.com/google/sentencepiece) instead. Primary difference being that sentencepiece runs BPE directly on Unicode code points instead of on UTF-8 encoded bytes. Feel free to explore sentencepiece on your own (good luck, it's not too pretty), and stretch goal if you really experience and suffer from the burden of time, re-write your BPE to be on Unicode code points and match the Llama 2 tokenizer.

first.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9e2c490db19a6238afee8a6869e07688db901dcf5f0a30b4a06329c77124f93b
+size 161

first.vocab ADDED Viewed

	@@ -0,0 +1,261 @@

+[\u0000] 0
+[\u0001] 1
+[\u0002] 2
+[\u0003] 3
+[\u0004] 4
+[\u0005] 5
+[\u0006] 6
+[\u0007] 7
+[\u0008] 8
+[\u0009] 9
+[\u000a] 10
+[\u000b] 11
+[\u000c] 12
+[\u000d] 13
+[\u000e] 14
+[\u000f] 15
+[\u0010] 16
+[\u0011] 17
+[\u0012] 18
+[\u0013] 19
+[\u0014] 20
+[\u0015] 21
+[\u0016] 22
+[\u0017] 23
+[\u0018] 24
+[\u0019] 25
+[\u001a] 26
+[\u001b] 27
+[\u001c] 28
+[\u001d] 29
+[\u001e] 30
+[\u001f] 31
+[ ] 32
+[!] 33
+["] 34
+[#] 35
+[$] 36
+[%] 37
+[&] 38
+['] 39
+[(] 40
+[)] 41
+[*] 42
+[+] 43
+[,] 44
+[-] 45
+[.] 46
+[/] 47
+[0] 48
+[1] 49
+[2] 50
+[3] 51
+[4] 52
+[5] 53
+[6] 54
+[7] 55
+[8] 56
+[9] 57
+[:] 58
+[;] 59
+[<] 60
+[=] 61
+[>] 62
+[?] 63
+[@] 64
+[A] 65
+[B] 66
+[C] 67
+[D] 68
+[E] 69
+[F] 70
+[G] 71
+[H] 72
+[I] 73
+[J] 74
+[K] 75
+[L] 76
+[M] 77
+[N] 78
+[O] 79
+[P] 80
+[Q] 81
+[R] 82
+[S] 83
+[T] 84
+[U] 85
+[V] 86
+[W] 87
+[X] 88
+[Y] 89
+[Z] 90
+[[] 91
+[\] 92
+[]] 93
+[^] 94
+[_] 95
+[`] 96
+[a] 97
+[b] 98
+[c] 99
+[d] 100
+[e] 101
+[f] 102
+[g] 103
+[h] 104
+[i] 105
+[j] 106
+[k] 107
+[l] 108
+[m] 109
+[n] 110
+[o] 111
+[p] 112
+[q] 113
+[r] 114
+[s] 115
+[t] 116
+[u] 117
+[v] 118
+[w] 119
+[x] 120
+[y] 121
+[z] 122
+[{] 123
+[|] 124
+[}] 125
+[~] 126
+[\u007f] 127
+[�] 128
+[�] 129
+[�] 130
+[�] 131
+[�] 132
+[�] 133
+[�] 134
+[�] 135
+[�] 136
+[�] 137
+[�] 138
+[�] 139
+[�] 140
+[�] 141
+[�] 142
+[�] 143
+[�] 144
+[�] 145
+[�] 146
+[�] 147
+[�] 148
+[�] 149
+[�] 150
+[�] 151
+[�] 152
+[�] 153
+[�] 154
+[�] 155
+[�] 156
+[�] 157
+[�] 158
+[�] 159
+[�] 160
+[�] 161
+[�] 162
+[�] 163
+[�] 164
+[�] 165
+[�] 166
+[�] 167
+[�] 168
+[�] 169
+[�] 170
+[�] 171
+[�] 172
+[�] 173
+[�] 174
+[�] 175
+[�] 176
+[�] 177
+[�] 178
+[�] 179
+[�] 180
+[�] 181
+[�] 182
+[�] 183
+[�] 184
+[�] 185
+[�] 186
+[�] 187
+[�] 188
+[�] 189
+[�] 190
+[�] 191
+[�] 192
+[�] 193
+[�] 194
+[�] 195
+[�] 196
+[�] 197
+[�] 198
+[�] 199
+[�] 200
+[�] 201
+[�] 202
+[�] 203
+[�] 204
+[�] 205
+[�] 206
+[�] 207
+[�] 208
+[�] 209
+[�] 210
+[�] 211
+[�] 212
+[�] 213
+[�] 214
+[�] 215
+[�] 216
+[�] 217
+[�] 218
+[�] 219
+[�] 220
+[�] 221
+[�] 222
+[�] 223
+[�] 224
+[�] 225
+[�] 226
+[�] 227
+[�] 228
+[�] 229
+[�] 230
+[�] 231
+[�] 232
+[�] 233
+[�] 234
+[�] 235
+[�] 236
+[�] 237
+[�] 238
+[�] 239
+[�] 240
+[�] 241
+[�] 242
+[�] 243
+[�] 244
+[�] 245
+[�] 246
+[�] 247
+[�] 248
+[�] 249
+[�] 250
+[�] 251
+[�] 252
+[�] 253
+[�] 254
+[�] 255
+[�][�] -> [�] 256
+[ ][�] -> [ �] 257
+[�][�] -> [�] 258
+[�][�] -> [ा] 259
+[�][�] -> [र] 260

lecture.md ADDED Viewed

	@@ -0,0 +1,107 @@

+# LLM Tokenization
+Hi everyone, today we are going to look at Tokenization in Large Language Models (LLMs). Sadly, tokenization is a relatively complex and gnarly component of the state of the art LLMs, but it is necessary to understand in some detail because a lot of the shortcomings of LLMs that may be attributed to the neural network or otherwise appear mysterious actually trace back to tokenization.
+### Previously: character-level tokenization
+So what is tokenization? Well it turns out that in our previous video, [Let's build GPT from scratch](https://www.youtube.com/watch?v=kCc8FmEb1nY), we already covered tokenization but it was only a very simple, naive, character-level version of it. When you go to the [Google colab](https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing) for that video, you'll see that we started with our training data ([Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)), which is just a large string in Python:
+```
+First Citizen: Before we proceed any further, hear me speak.
+All: Speak, speak.
+First Citizen: You are all resolved rather to die than to famish?
+All: Resolved. resolved.
+First Citizen: First, you know Caius Marcius is chief enemy to the people.
+All: We know't, we know't.
+```
+But how do we feed strings into a language model? Well, we saw that we did this by first constructing a vocabulary of all the possible characters we found in the entire training set:
+```python
+# here are all the unique characters that occur in this text
+chars = sorted(list(set(text)))
+vocab_size = len(chars)
+print(''.join(chars))
+print(vocab_size)
+# !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
+# 65
+```
+And then creating a lookup table for converting between individual characters and integers according to the vocabulary above. This lookup table was just a Python dictionary:
+```python
+stoi = { ch:i for i,ch in enumerate(chars) }
+itos = { i:ch for i,ch in enumerate(chars) }
+# encoder: take a string, output a list of integers
+encode = lambda s: [stoi[c] for c in s]
+# decoder: take a list of integers, output a string
+decode = lambda l: ''.join([itos[i] for i in l])
+print(encode("hii there"))
+print(decode(encode("hii there")))
+# [46, 47, 47, 1, 58, 46, 43, 56, 43]
+# hii there
+```
+Once we've converted a string into a sequence of integers, we saw that each integer was used as an index into a 2-dimensional embedding of trainable parameters. Because we have a vocabulary size of `vocab_size=65`, this embedding table will also have 65 rows:
+```python
+class BigramLanguageModel(nn.Module):
+def __init__(self, vocab_size):
+	super().__init__()
+	self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
+def forward(self, idx, targets=None):
+	tok_emb = self.token_embedding_table(idx) # (B,T,C)
+```
+Here, the integer "plucks out" a row of this embedding table and this row is the vector that represents this token. This vector then feeds into the Transformer as the input at the corresponding time step.
+### "Character chunks" for tokenization using the BPE algorithm
+This is all well and good for the naive setting of a character-level language model. But in practice, in state of the art language models, people use a lot more complicated schemes for constructing these token vocabularies. In particular, these schemes work not on a character level, but on character chunk level. And the way these chunk vocabularies are constructed is by using algorithms such as the **Byte Pair Encoding** (BPE) algorithm, which we are going to cover in detail below.
+Turning to the historical development of this approach for a moment, the paper that popularized the use of the byte-level BPE algorithm for language model tokenization is the [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) from OpenAI in 2019, "Language Models are Unsupervised Multitask Learners". Scroll down to Section 2.2 on "Input Representation" where they describe and motivate this algorithm. At the end of this section you'll see them say:
+> *The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens and a larger batchsize of 512 is used.*
+Recall that in the attention layer of a Transformer, every token is attending to a finite list of tokens previously in the sequence. The paper here says that the GPT-2 model has a context length of 1024 tokens, up from 512 in GPT-1. In other words, tokens are the fundamental "atoms" at the input to the LLM. And tokenization is the process for taking raw strings in Python and converting them to a list of tokens, and vice versa. As another popular example to demonstrate the pervasiveness of this abstraction, if you go to the [Llama 2](https://arxiv.org/abs/2307.09288) paper as well and you search for "token", you're going to get 63 hits. So for example, the paper claims that they trained on 2 trillion tokens, etc.
+### Brief taste of the complexities of tokenization
+Before we dive into details of the implementation, let's briefly motivate the need to understand the tokenization process in some detail. Tokenization is at the heart of a lot of weirdness in LLMs and I would advise that you do not brush it off. A lot of the issues that may look like issues with the neural network architecture actually trace back to tokenization. Here are just a few examples:
+- Why can't LLM spell words? **Tokenization**.
+- Why can't LLM do super simple string processing tasks like reversing a string? **Tokenization**.
+- Why is LLM worse at non-English languages (e.g. Japanese)? **Tokenization**.
+- Why is LLM bad at simple arithmetic? **Tokenization**.
+- Why did GPT-2 have more than necessary trouble coding in Python? **Tokenization**.
+- Why did my LLM abruptly halt when it sees the string "<|endoftext|>"? **Tokenization**.
+- What is this weird warning I get about a "trailing whitespace"? **Tokenization**.
+- Why did the LLM break if I ask it about "SolidGoldMagikarp"? **Tokenization**.
+- Why should I prefer to use YAML over JSON with LLMs? **Tokenization**.
+- Why is LLM not actually end-to-end language modeling? **Tokenization**.
+- What is the real root of suffering? **Tokenization**.
+We will loop back around to these at the end of the video.
+### Visual preview of tokenization
+Next, let's load this [tokenization webapp](https://tiktokenizer.vercel.app). What is nice about this webapp is that tokenization is running live in your web browser, allowing you to easily input some text string at the input, and see the tokenization on the right. On the top, you can see that we are currently using the `gpt2` tokenizer, and we see that the string that we pasted in with this example is currently tokenizing into 300 tokens. Here they are shown explicitly in colors:
+![tiktokenizer](assets/tiktokenizer.png)
+So for example, the string "Tokenization" encoded into the tokens 30642 followed by the token 1634. The token " is" (note that these is three characters, including the space in the front, this is important!) is index 318. Be careful with whitespace because it is absolutely present in the string and must be tokenized along with all the other characters, but is usually omitted in visualization for clarity. You can toggle on and off its visualization at the bottom of the app. In the same way, the token " at" is 379, " the" is 262, etc.
+Next, we have a simple example of some arithmetic. Here, we see that numbers may be inconsistently decomposed by the tokenizer. For example, the number 127 is a single token of three characters, but the number 677 because two tokens: the token " 6" (again, note the space in the front!) and the token "77". We rely on the large language model to make sense of this arbitrariness. It has to learn inside its parameters and during training that these two tokens (" 6" and "77" actually combine to create the number 677). In the same way, we see that if the LLM wanted to predict that the result of this sum is the number 804, it would have to output that in two time steps: first it has to emit the token " 8", and then the token "04". Note that all of these splits look completely arbitrary. In the example right below, we see that 1275 is "12" followed by "75", 6773 is actually two tokens " 6", "773", and 8041 is " 8", "041".
+(to be continued...)
+(TODO: may continue this unless we figure out how to generate it automatically from the video :))

minbep.py ADDED Viewed

	@@ -0,0 +1,23 @@

+from minbpe import RegexTokenizer
+# Initialize the tokenizer
+tokenizer = RegexTokenizer()
+# Read text from a file
+file_path = "/Users/mohammad.ibrahim/Desktop/TSAI/combined_text.txt"
+with open(file_path, 'r', encoding='utf-8') as file:
+    text = file.read()
+# Train the tokenizer
+tokenizer.train(text, 256 + 5)  # 256 are the byte tokens, then do 3 merges
+# Encode the text
+encoded_text = tokenizer.encode(text)
+print("Encoded:", encoded_text)
+# Decode the text
+decoded_text = tokenizer.decode(encoded_text)
+print("Decoded:", decoded_text)
+# Save the trained tokenizer model
+tokenizer.save("first")  # Writes two files: toy.model (for loading) and toy.vocab (for viewing)

minbpe/__init__.py ADDED Viewed

	@@ -0,0 +1,4 @@

+from .base import Tokenizer
+from .basic import BasicTokenizer
+from .regex import RegexTokenizer
+from .gpt4 import GPT4Tokenizer

minbpe/base.py ADDED Viewed

	@@ -0,0 +1,165 @@

+"""
+Contains the base Tokenizer class and a few common helper functions.
+The base class also contains the (common) save/load functionality.
+It would be possible to be a lot more strict about the interface and
+e.g. isolating all regex/pattern parts to the RegexTokenizer, but
+some concessions are made for simplicity.
+"""
+import unicodedata
+# -----------------------------------------------------------------------------
+# a few helper functions useful for both BasicTokenizer and RegexTokenizer
+def get_stats(ids, counts=None):
+    """
+    Given a list of integers, return a dictionary of counts of consecutive pairs
+    Example: [1, 2, 3, 1, 2] -> {(1, 2): 2, (2, 3): 1, (3, 1): 1}
+    Optionally allows to update an existing dictionary of counts
+    """
+    counts = {} if counts is None else counts
+    for pair in zip(ids, ids[1:]): # iterate consecutive elements
+        counts[pair] = counts.get(pair, 0) + 1
+    return counts
+def merge(ids, pair, idx):
+    """
+    In the list of integers (ids), replace all consecutive occurrences
+    of pair with the new integer token idx
+    Example: ids=[1, 2, 3, 1, 2], pair=(1, 2), idx=4 -> [4, 3, 4]
+    """
+    newids = []
+    i = 0
+    while i < len(ids):
+        # if not at the very last position AND the pair matches, replace it
+        if ids[i] == pair[0] and i < len(ids) - 1 and ids[i+1] == pair[1]:
+            newids.append(idx)
+            i += 2
+        else:
+            newids.append(ids[i])
+            i += 1
+    return newids
+# first two helper functions...
+def replace_control_characters(s: str) -> str:
+    # we don't want to print control characters
+    # which distort the output (e.g. \n or much worse)
+    # https://stackoverflow.com/questions/4324790/removing-control-characters-from-a-string-in-python/19016117#19016117
+    # http://www.unicode.org/reports/tr44/#GC_Values_Table
+    chars = []
+    for ch in s:
+        if unicodedata.category(ch)[0] != "C":
+            chars.append(ch) # this character is ok
+        else:
+            chars.append(f"\\u{ord(ch):04x}") # escape
+    return "".join(chars)
+def render_token(t: bytes) -> str:
+    # pretty print a token, escaping control characters
+    s = t.decode('utf-8', errors='replace')
+    s = replace_control_characters(s)
+    return s
+# -----------------------------------------------------------------------------
+# the base Tokenizer class
+class Tokenizer:
+    """Base class for Tokenizers"""
+    def __init__(self):
+        # default: vocab size of 256 (all bytes), no merges, no patterns
+        self.merges = {} # (int, int) -> int
+        self.pattern = "" # str
+        self.special_tokens = {} # str -> int, e.g. {'<|endoftext|>': 100257}
+        self.vocab = self._build_vocab() # int -> bytes
+    def train(self, text, vocab_size, verbose=False):
+        # Tokenizer can train a vocabulary of size vocab_size from text
+        raise NotImplementedError
+    def encode(self, text):
+        # Tokenizer can encode a string into a list of integers
+        raise NotImplementedError
+    def decode(self, ids):
+        # Tokenizer can decode a list of integers into a string
+        raise NotImplementedError
+    def _build_vocab(self):
+        # vocab is simply and deterministically derived from merges
+        vocab = {idx: bytes([idx]) for idx in range(256)}
+        for (p0, p1), idx in self.merges.items():
+            vocab[idx] = vocab[p0] + vocab[p1]
+        for special, idx in self.special_tokens.items():
+            vocab[idx] = special.encode("utf-8")
+        return vocab
+    def save(self, file_prefix):
+        """
+        Saves two files: file_prefix.vocab and file_prefix.model
+        This is inspired (but not equivalent to!) sentencepiece's model saving:
+        - model file is the critical one, intended for load()
+        - vocab file is just a pretty printed version for human inspection only
+        """
+        # write the model: to be used in load() later
+        model_file = file_prefix + ".model"
+        with open(model_file, 'w') as f:
+            # write the version, pattern and merges, that's all that's needed
+            f.write("minbpe v1\n")
+            f.write(f"{self.pattern}\n")
+            # write the special tokens, first the number of them, then each one
+            f.write(f"{len(self.special_tokens)}\n")
+            for special, idx in self.special_tokens.items():
+                f.write(f"{special} {idx}\n")
+            # the merges dict
+            for idx1, idx2 in self.merges:
+                f.write(f"{idx1} {idx2}\n")
+        # write the vocab: for the human to look at
+        vocab_file = file_prefix + ".vocab"
+        inverted_merges = {idx: pair for pair, idx in self.merges.items()}
+        with open(vocab_file, "w", encoding="utf-8") as f:
+            for idx, token in self.vocab.items():
+                # note: many tokens may be partial utf-8 sequences
+                # and cannot be decoded into valid strings. Here we're using
+                # errors='replace' to replace them with the replacement char �.
+                # this also means that we couldn't possibly use .vocab in load()
+                # because decoding in this way is a lossy operation!
+                s = render_token(token)
+                # find the children of this token, if any
+                if idx in inverted_merges:
+                    # if this token has children, render it nicely as a merge
+                    idx0, idx1 = inverted_merges[idx]
+                    s0 = render_token(self.vocab[idx0])
+                    s1 = render_token(self.vocab[idx1])
+                    f.write(f"[{s0}][{s1}] -> [{s}] {idx}\n")
+                else:
+                    # otherwise this is leaf token, just print it
+                    # (this should just be the first 256 tokens, the bytes)
+                    f.write(f"[{s}] {idx}\n")
+    def load(self, model_file):
+        """Inverse of save() but only for the model file"""
+        assert model_file.endswith(".model")
+        # read the model file
+        merges = {}
+        special_tokens = {}
+        idx = 256
+        with open(model_file, 'r', encoding="utf-8") as f:
+            # read the version
+            version = f.readline().strip()
+            assert version == "minbpe v1"
+            # read the pattern
+            self.pattern = f.readline().strip()
+            # read the special tokens
+            num_special = int(f.readline().strip())
+            for _ in range(num_special):
+                special, special_idx = f.readline().strip().split()
+                special_tokens[special] = int(special_idx)
+            # read the merges
+            for line in f:
+                idx1, idx2 = map(int, line.split())
+                merges[(idx1, idx2)] = idx
+                idx += 1
+        self.merges = merges
+        self.special_tokens = special_tokens
+        self.vocab = self._build_vocab()

minbpe/basic.py ADDED Viewed

	@@ -0,0 +1,74 @@

+"""
+Minimal (byte-level) Byte Pair Encoding tokenizer.
+Algorithmically follows along the GPT tokenizer:
+https://github.com/openai/gpt-2/blob/master/src/encoder.py
+But:
+- Does not handle the regular expression splitting pattern.
+- Does not handle any special tokens.
+"""
+from .base import Tokenizer, get_stats, merge
+class BasicTokenizer(Tokenizer):
+    def __init__(self):
+        super().__init__()
+    def train(self, text, vocab_size, verbose=False):
+        assert vocab_size >= 256
+        num_merges = vocab_size - 256
+        # input text preprocessing
+        text_bytes = text.encode("utf-8") # raw bytes
+        ids = list(text_bytes) # list of integers in range 0..255
+        # iteratively merge the most common pairs to create new tokens
+        merges = {} # (int, int) -> int
+        vocab = {idx: bytes([idx]) for idx in range(256)} # int -> bytes
+        for i in range(num_merges):
+            # count up the number of times every consecutive pair appears
+            stats = get_stats(ids)
+            # find the pair with the highest count
+            pair = max(stats, key=stats.get)
+            # mint a new token: assign it the next available id
+            idx = 256 + i
+            # replace all occurrences of pair in ids with idx
+            ids = merge(ids, pair, idx)
+            # save the merge
+            merges[pair] = idx
+            vocab[idx] = vocab[pair[0]] + vocab[pair[1]]
+            # prints
+            if verbose:
+                print(f"merge {i+1}/{num_merges}: {pair} -> {idx} ({vocab[idx]}) had {stats[pair]} occurrences")
+        # save class variables
+        self.merges = merges # used in encode()
+        self.vocab = vocab   # used in decode()
+    def decode(self, ids):
+        # given ids (list of integers), return Python string
+        text_bytes = b"".join(self.vocab[idx] for idx in ids)
+        text = text_bytes.decode("utf-8", errors="replace")
+        return text
+    def encode(self, text):
+        # given a string text, return the token ids
+        text_bytes = text.encode("utf-8") # raw bytes
+        ids = list(text_bytes) # list of integers in range 0..255
+        while len(ids) >= 2:
+            # find the pair with the lowest merge index
+            stats = get_stats(ids)
+            pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
+            # subtle: if there are no more merges available, the key will
+            # result in an inf for every single pair, and the min will be
+            # just the first pair in the list, arbitrarily
+            # we can detect this terminating case by a membership check
+            if pair not in self.merges:
+                break # nothing else can be merged anymore
+            # otherwise let's merge the best pair (lowest merge index)
+            idx = self.merges[pair]
+            ids = merge(ids, pair, idx)
+        return ids

minbpe/gpt4.py ADDED Viewed

	@@ -0,0 +1,130 @@

+"""
+Implements the GPT-4 Tokenizer as a light wrapper around the RegexTokenizer.
+Note that this is a pretrained tokenizer. By default and inside init(), it
+loads the pretrained tokenizer from the `cl100k_base` tokenizer of tiktoken.
+"""
+import tiktoken
+from .regex import RegexTokenizer
+def bpe(mergeable_ranks, token, max_rank):
+    # helper function used in get_gpt4_merges() to reconstruct the merge forest
+    parts = [bytes([b]) for b in token]
+    while True:
+        min_idx = None
+        min_rank = None
+        for i, pair in enumerate(zip(parts[:-1], parts[1:])):
+            rank = mergeable_ranks.get(pair[0] + pair[1])
+            if rank is not None and (min_rank is None or rank < min_rank):
+                min_idx = i
+                min_rank = rank
+        if min_rank is None or (max_rank is not None and min_rank >= max_rank):
+            break
+        assert min_idx is not None
+        parts = parts[:min_idx] + [parts[min_idx] + parts[min_idx + 1]] + parts[min_idx + 2:]
+    return parts
+def recover_merges(mergeable_ranks):
+    # the `merges` are already the byte sequences in their merged state.
+    # so we have to recover the original pairings. We can do this by doing
+    # a small BPE training run on all the tokens, in their order.
+    # also see https://github.com/openai/tiktoken/issues/60
+    # also see https://github.com/karpathy/minbpe/issues/11#issuecomment-1950805306
+    merges = {}
+    for token, rank in mergeable_ranks.items():
+        if len(token) == 1:
+            continue # skip raw bytes
+        pair = tuple(bpe(mergeable_ranks, token, max_rank=rank))
+        assert len(pair) == 2
+        # recover the integer ranks of the pair
+        ix0 = mergeable_ranks[pair[0]]
+        ix1 = mergeable_ranks[pair[1]]
+        merges[(ix0, ix1)] = rank
+    return merges
+GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
+GPT4_SPECIAL_TOKENS = {
+    '<|endoftext|>': 100257,
+    '<|fim_prefix|>': 100258,
+    '<|fim_middle|>': 100259,
+    '<|fim_suffix|>': 100260,
+    '<|endofprompt|>': 100276
+}
+class GPT4Tokenizer(RegexTokenizer):
+    """Lightweight wrapper on RegexTokenizer that matches GPT-4's tokenizer."""
+    def __init__(self):
+        super().__init__(pattern=GPT4_SPLIT_PATTERN)
+        # get the official tokenizer and its merges
+        enc = tiktoken.get_encoding("cl100k_base")
+        mergeable_ranks = enc._mergeable_ranks
+        # the merges are those of gpt4, but we have to recover them
+        self.merges = recover_merges(mergeable_ranks)
+        # reconstruct the vocab from the merges
+        vocab = {idx: bytes([idx]) for idx in range(256)}
+        for (p0, p1), idx in self.merges.items():
+            vocab[idx] = vocab[p0] + vocab[p1]
+        self.vocab = vocab
+        # now here is another tricky part.
+        # for some reason, the tokens corresponding to individual bytes
+        # are permuted in a different order. This is completely non-sensical
+        # and probably historical, but therefore we have to deal with it here.
+        self.byte_shuffle = {i: mergeable_ranks[bytes([i])] for i in range(256)}
+        self.inverse_byte_shuffle = {v: k for k, v in self.byte_shuffle.items()}
+        # finally register the special tokens
+        self.register_special_tokens(GPT4_SPECIAL_TOKENS)
+    def _encode_chunk(self, text_bytes):
+        # before we start processing bytes, we have to permute them
+        text_bytes = bytes(self.byte_shuffle[b] for b in text_bytes)
+        ids = super()._encode_chunk(text_bytes)
+        return ids
+    def decode(self, ids):
+        # we have to un-permute the bytes before we decode
+        text_bytes = b"".join(self.vocab[idx] for idx in ids)
+        text_bytes = bytes(self.inverse_byte_shuffle[b] for b in text_bytes)
+        text = text_bytes.decode("utf-8", errors="replace")
+        return text
+    # this is a pretrained tokenizer, it is not intended to be trained
+    def train(self, text, vocab_size, verbose=False):
+        raise NotImplementedError
+    # save/load would require some thought.
+    # we'd have to change save/load of base to add support for byte_shuffle...
+    # alternatively, we could move byte_shuffle to base class, but that would
+    # mean that we're making ugly our beautiful Tokenizer just to support
+    # the GPT-4 tokenizer and its weird historical quirks around byte_shuffle.
+    def save(self, file_prefix):
+        raise NotImplementedError("GPT4Tokenizer cannot be saved.")
+    def load(self, model_file):
+        raise NotImplementedError("GPT4Tokenizer cannot be loaded.")
+    def save_vocab(self, vocab_file):
+        # just for visualization purposes let's output the GPT-4 tokens
+        # in the exact same format as the base class would.
+        # simple run as:
+        # python -c "from minbpe import GPT4Tokenizer; GPT4Tokenizer().save_vocab('gpt4.vocab')"
+        from .base import render_token
+        # build vocab being mindful of the byte shuffle
+        vocab = {idx: bytes([self.inverse_byte_shuffle[idx]]) for idx in range(256)}
+        for (p0, p1), idx in self.merges.items():
+            vocab[idx] = vocab[p0] + vocab[p1]
+        # now merge the shuffled bytes and write to file
+        inverted_merges = {idx: pair for pair, idx in self.merges.items()}
+        with open(vocab_file, "w", encoding="utf-8") as f:
+            for idx, token in vocab.items():
+                s = render_token(token)
+                if idx in inverted_merges:
+                    idx0, idx1 = inverted_merges[idx]
+                    s0 = render_token(vocab[idx0])
+                    s1 = render_token(vocab[idx1])
+                    f.write(f"[{s0}][{s1}] -> [{s}] {idx}\n")
+                else:
+                    f.write(f"[{s}] {idx}\n")

minbpe/regex.py ADDED Viewed

	@@ -0,0 +1,164 @@

+"""
+Minimal (byte-level) Byte Pair Encoding tokenizer.
+Algorithmically follows along the GPT tokenizer:
+https://github.com/openai/gpt-2/blob/master/src/encoder.py
+Unlike BasicTokenizer:
+- RegexTokenizer handles an optional regex splitting pattern.
+- RegexTokenizer handles optional special tokens.
+"""
+import regex as re
+from .base import Tokenizer, get_stats, merge
+# the main GPT text split patterns, see
+# https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py
+GPT2_SPLIT_PATTERN = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
+GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
+class RegexTokenizer(Tokenizer):
+    def __init__(self, pattern=None):
+        """
+        - pattern: optional string to override the default (GPT-4 split pattern)
+        - special_tokens: str -> int dictionary of special tokens
+          example: {'<|endoftext|>': 100257}
+        """
+        super().__init__()
+        self.pattern = GPT4_SPLIT_PATTERN if pattern is None else pattern
+        self.compiled_pattern = re.compile(self.pattern)
+        self.special_tokens = {}
+        self.inverse_special_tokens = {}
+    def train(self, text, vocab_size, verbose=False):
+        assert vocab_size >= 256
+        num_merges = vocab_size - 256
+        # split the text up into text chunks
+        text_chunks = re.findall(self.compiled_pattern, text)
+        # input text preprocessing
+        ids = [list(ch.encode("utf-8")) for ch in text_chunks]
+        # iteratively merge the most common pairs to create new tokens
+        merges = {} # (int, int) -> int
+        vocab = {idx: bytes([idx]) for idx in range(256)} # idx -> bytes
+        for i in range(num_merges):
+            # count the number of times every consecutive pair appears
+            stats = {}
+            for chunk_ids in ids:
+                # passing in stats will update it in place, adding up counts
+                get_stats(chunk_ids, stats)
+            # find the pair with the highest count
+            pair = max(stats, key=stats.get)
+            # mint a new token: assign it the next available id
+            idx = 256 + i
+            # replace all occurrences of pair in ids with idx
+            ids = [merge(chunk_ids, pair, idx) for chunk_ids in ids]
+            # save the merge
+            merges[pair] = idx
+            vocab[idx] = vocab[pair[0]] + vocab[pair[1]]
+            # prints
+            if verbose:
+                print(f"merge {i+1}/{num_merges}: {pair} -> {idx} ({vocab[idx]}) had {stats[pair]} occurrences")
+        # save class variables
+        self.merges = merges # used in encode()
+        self.vocab = vocab   # used in decode()
+    def register_special_tokens(self, special_tokens):
+        # special_tokens is a dictionary of str -> int
+        # example: {"<|endoftext|>": 100257}
+        self.special_tokens = special_tokens
+        self.inverse_special_tokens = {v: k for k, v in special_tokens.items()}
+    def decode(self, ids):
+        # given ids (list of integers), return Python string
+        part_bytes = []
+        for idx in ids:
+            if idx in self.vocab:
+                part_bytes.append(self.vocab[idx])
+            elif idx in self.inverse_special_tokens:
+                part_bytes.append(self.inverse_special_tokens[idx].encode("utf-8"))
+            else:
+                raise ValueError(f"invalid token id: {idx}")
+        text_bytes = b"".join(part_bytes)
+        text = text_bytes.decode("utf-8", errors="replace")
+        return text
+    def _encode_chunk(self, text_bytes):
+        # return the token ids
+        # let's begin. first, convert all bytes to integers in range 0..255
+        ids = list(text_bytes)
+        while len(ids) >= 2:
+            # find the pair with the lowest merge index
+            stats = get_stats(ids)
+            pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
+            # subtle: if there are no more merges available, the key will
+            # result in an inf for every single pair, and the min will be
+            # just the first pair in the list, arbitrarily
+            # we can detect this terminating case by a membership check
+            if pair not in self.merges:
+                break # nothing else can be merged anymore
+            # otherwise let's merge the best pair (lowest merge index)
+            idx = self.merges[pair]
+            ids = merge(ids, pair, idx)
+        return ids
+    def encode_ordinary(self, text):
+        """Encoding that ignores any special tokens."""
+        # split text into chunks of text by categories defined in regex pattern
+        text_chunks = re.findall(self.compiled_pattern, text)
+        # all chunks of text are encoded separately, then results are joined
+        ids = []
+        for chunk in text_chunks:
+            chunk_bytes = chunk.encode("utf-8") # raw bytes
+            chunk_ids = self._encode_chunk(chunk_bytes)
+            ids.extend(chunk_ids)
+        return ids
+    def encode(self, text, allowed_special="none_raise"):
+        """
+        Unlike encode_ordinary, this function handles special tokens.
+        allowed_special: can be "all"|"none"|"none_raise" or a custom set of special tokens
+        if none_raise, then an error is raised if any special token is encountered in text
+        this is the default tiktoken behavior right now as well
+        any other behavior is either annoying, or a major footgun
+        """
+        # decode the user desire w.r.t. handling of special tokens
+        special = None
+        if allowed_special == "all":
+            special = self.special_tokens
+        elif allowed_special == "none":
+            special = {}
+        elif allowed_special == "none_raise":
+            special = {}
+            assert all(token not in text for token in self.special_tokens)
+        elif isinstance(allowed_special, set):
+            special = {k: v for k, v in self.special_tokens.items() if k in allowed_special}
+        else:
+            raise ValueError(f"allowed_special={allowed_special} not understood")
+        if not special:
+            # shortcut: if no special tokens, just use the ordinary encoding
+            return self.encode_ordinary(text)
+        # otherwise, we have to be careful with potential special tokens in text
+        # we handle special tokens by splitting the text
+        # based on the occurrence of any exact match with any of the special tokens
+        # we can use re.split for this. note that surrounding the pattern with ()
+        # makes it into a capturing group, so the special tokens will be included
+        special_pattern = "(" + "|".join(re.escape(k) for k in special) + ")"
+        special_chunks = re.split(special_pattern, text)
+        # now all the special characters are separated from the rest of the text
+        # all chunks of text are encoded separately, then results are joined
+        ids = []
+        for part in special_chunks:
+            if part in special:
+                # this is a special token, encode it separately as a special case
+                ids.append(special[part])
+            else:
+                # this is an ordinary sequence, encode it normally
+                ids.extend(self.encode_ordinary(part))
+        return ids

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ regex
2	+ tiktoken

tests/__init__.py ADDED Viewed

File without changes

tests/taylorswift.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

tests/test_tokenizer.py ADDED Viewed

	@@ -0,0 +1,135 @@

+import pytest
+import tiktoken
+import os
+from minbpe import BasicTokenizer, RegexTokenizer, GPT4Tokenizer
+# -----------------------------------------------------------------------------
+# common test data
+# a few strings to test the tokenizers on
+test_strings = [
+    "", # empty string
+    "?", # single character
+    "hello world!!!? (안녕하세요!) lol123 😉", # fun small string
+    "FILE:taylorswift.txt", # FILE: is handled as a special string in unpack()
+]
+def unpack(text):
+    # we do this because `pytest -v .` prints the arguments to console, and we don't
+    # want to print the entire contents of the file, it creates a mess. So here we go.
+    if text.startswith("FILE:"):
+        dirname = os.path.dirname(os.path.abspath(__file__))
+        taylorswift_file = os.path.join(dirname, text[5:])
+        contents = open(taylorswift_file, "r", encoding="utf-8").read()
+        return contents
+    else:
+        return text
+specials_string = """
+<|endoftext|>Hello world this is one document
+<|endoftext|>And this is another document
+<|endoftext|><|fim_prefix|>And this one has<|fim_suffix|> tokens.<|fim_middle|> FIM
+<|endoftext|>Last document!!! 👋<|endofprompt|>
+""".strip()
+special_tokens = {
+    '<|endoftext|>': 100257,
+    '<|fim_prefix|>': 100258,
+    '<|fim_middle|>': 100259,
+    '<|fim_suffix|>': 100260,
+    '<|endofprompt|>': 100276
+}
+llama_text = """
+<|endoftext|>The llama (/ˈlɑːmə/; Spanish pronunciation: [ˈʎama] or [ˈʝama]) (Lama glama) is a domesticated South American camelid, widely used as a meat and pack animal by Andean cultures since the pre-Columbian era.
+Llamas are social animals and live with others as a herd. Their wool is soft and contains only a small amount of lanolin.[2] Llamas can learn simple tasks after a few repetitions. When using a pack, they can carry about 25 to 30% of their body weight for 8 to 13 km (5–8 miles).[3] The name llama (in the past also spelled "lama" or "glama") was adopted by European settlers from native Peruvians.[4]
+The ancestors of llamas are thought to have originated from the Great Plains of North America about 40 million years ago, and subsequently migrated to South America about three million years ago during the Great American Interchange. By the end of the last ice age (10,000–12,000 years ago), camelids were extinct in North America.[3] As of 2007, there were over seven million llamas and alpacas in South America and over 158,000 llamas and 100,000 alpacas, descended from progenitors imported late in the 20th century, in the United States and Canada.[5]
+<|fim_prefix|>In Aymara mythology, llamas are important beings. The Heavenly Llama is said to drink water from the ocean and urinates as it rains.[6] According to Aymara eschatology,<|fim_suffix|> where they come from at the end of time.[6]<|fim_middle|> llamas will return to the water springs and ponds<|endofprompt|>
+""".strip()
+# -----------------------------------------------------------------------------
+# tests
+# test encode/decode identity for a few different strings
+@pytest.mark.parametrize("tokenizer_factory", [BasicTokenizer, RegexTokenizer, GPT4Tokenizer])
+@pytest.mark.parametrize("text", test_strings)
+def test_encode_decode_identity(tokenizer_factory, text):
+    text = unpack(text)
+    tokenizer = tokenizer_factory()
+    ids = tokenizer.encode(text)
+    decoded = tokenizer.decode(ids)
+    assert text == decoded
+# test that our tokenizer matches the official GPT-4 tokenizer
+@pytest.mark.parametrize("text", test_strings)
+def test_gpt4_tiktoken_equality(text):
+    text = unpack(text)
+    tokenizer = GPT4Tokenizer()
+    enc = tiktoken.get_encoding("cl100k_base")
+    tiktoken_ids = enc.encode(text)
+    gpt4_tokenizer_ids = tokenizer.encode(text)
+    assert gpt4_tokenizer_ids == tiktoken_ids
+# test the handling of special tokens
+def test_gpt4_tiktoken_equality_special_tokens():
+    tokenizer = GPT4Tokenizer()
+    enc = tiktoken.get_encoding("cl100k_base")
+    tiktoken_ids = enc.encode(specials_string, allowed_special="all")
+    gpt4_tokenizer_ids = tokenizer.encode(specials_string, allowed_special="all")
+    assert gpt4_tokenizer_ids == tiktoken_ids
+# reference test to add more tests in the future
+@pytest.mark.parametrize("tokenizer_factory", [BasicTokenizer, RegexTokenizer])
+def test_wikipedia_example(tokenizer_factory):
+    """
+    Quick unit test, following along the Wikipedia example:
+    https://en.wikipedia.org/wiki/Byte_pair_encoding
+    According to Wikipedia, running bpe on the input string:
+    "aaabdaaabac"
+    for 3 merges will result in string:
+    "XdXac"
+    where:
+    X=ZY
+    Y=ab
+    Z=aa
+    Keep in mind that for us a=97, b=98, c=99, d=100 (ASCII values)
+    so Z will be 256, Y will be 257, X will be 258.
+    So we expect the output list of ids to be [258, 100, 258, 97, 99]
+    """
+    tokenizer = tokenizer_factory()
+    text = "aaabdaaabac"
+    tokenizer.train(text, 256 + 3)
+    ids = tokenizer.encode(text)
+    assert ids == [258, 100, 258, 97, 99]
+    assert tokenizer.decode(tokenizer.encode(text)) == text
+@pytest.mark.parametrize("special_tokens", [{}, special_tokens])
+def test_save_load(special_tokens):
+    # take a bit more complex piece of text and train the tokenizer, chosen at random
+    text = llama_text
+    # create a Tokenizer and do 64 merges
+    tokenizer = RegexTokenizer()
+    tokenizer.train(text, 256 + 64)
+    tokenizer.register_special_tokens(special_tokens)
+    # verify that decode(encode(x)) == x
+    assert tokenizer.decode(tokenizer.encode(text, "all")) == text
+    # verify that save/load work as expected
+    ids = tokenizer.encode(text, "all")
+    # save the tokenizer (TODO use a proper temporary directory)
+    tokenizer.save("test_tokenizer_tmp")
+    # re-load the tokenizer
+    tokenizer = RegexTokenizer()
+    tokenizer.load("test_tokenizer_tmp.model")
+    # verify that decode(encode(x)) == x
+    assert tokenizer.decode(ids) == text
+    assert tokenizer.decode(tokenizer.encode(text, "all")) == text
+    assert tokenizer.encode(text, "all") == ids
+    # delete the temporary files
+    for file in ["test_tokenizer_tmp.model", "test_tokenizer_tmp.vocab"]:
+        os.remove(file)
+if __name__ == "__main__":
+    pytest.main()

tokenize.ipynb ADDED Viewed

	@@ -0,0 +1,128 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "file_path = \"/Users/mohammad.ibrahim/Desktop/TSAI/combined_text.txt\"\n",
+    "with open(file_path, 'r', encoding='utf-8') as file:\n",
+    "    text = file.read()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pattern = r\"\"\"'(?i:[sdmt]|ll|ve|re)|[^\\r\\n\\p{L}\\p{N}।•]?+\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}।•]++[\\r\\n]*|\\s*[\\r\\n]|\\s+(?!\\S)|\\s+|।|•\"\"\"\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import regex as re\n",
+    "text_chunks = re.findall(pattern, text)\n",
+    "\n",
+    "        # input text preprocessing\n",
+    "tokens = [list(ch.encode(\"utf-8\")) for ch in text_chunks]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# tokens = text.encode(\"utf-8\") # raw bytes\n",
+    "tokens = list(map(int, tokens)) # convert to a list of integers in range 0..255 for convenience"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tokens length: 179910393\n",
+      "ids length: 32798069\n",
+      "compression ratio: 5.49X\n"
+     ]
+    }
+   ],
+   "source": [
+    "def get_stats(ids):\n",
+    "    counts = {}\n",
+    "    for pair in zip(ids, ids[1:]):\n",
+    "        counts[pair] = counts.get(pair, 0) + 1\n",
+    "    return counts\n",
+    "\n",
+    "def merge(ids, pair, idx):\n",
+    "  newids = []\n",
+    "  i = 0\n",
+    "  while i < len(ids):\n",
+    "    if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:\n",
+    "      newids.append(idx)\n",
+    "      i += 2\n",
+    "    else:\n",
+    "      newids.append(ids[i])\n",
+    "      i += 1\n",
+    "  return newids\n",
+    "\n",
+    "# ---\n",
+    "vocab_size = 1000 # the desired final vocabulary size\n",
+    "num_merges = vocab_size - 256\n",
+    "ids = list(tokens) # copy so we don't destroy the original list\n",
+    "\n",
+    "merges = {} # (int, int) -> int\n",
+    "for i in range(num_merges):\n",
+    "  stats = get_stats(ids)\n",
+    "  pair = max(stats, key=stats.get)\n",
+    "  idx = 256 + i\n",
+    "  # print(f\"merging {pair} into a new token {idx}\")\n",
+    "  ids = merge(ids, pair, idx)\n",
+    "  merges[pair] = idx\n",
+    "\n",
+    "print(\"tokens length:\", len(tokens))\n",
+    "print(\"ids length:\", len(ids))\n",
+    "print(f\"compression ratio: {len(tokens) / len(ids):.2f}X\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

train.py ADDED Viewed

	@@ -0,0 +1,27 @@

+"""
+Train our Tokenizers on some data, just to see them in action.
+The whole thing runs in ~25 seconds on my laptop.
+"""
+import os
+import time
+from minbpe import BasicTokenizer, RegexTokenizer
+# open some text and train a vocab of 512 tokens
+text = open("tests/taylorswift.txt", "r", encoding="utf-8").read()
+# create a directory for models, so we don't pollute the current directory
+os.makedirs("models", exist_ok=True)
+t0 = time.time()
+for TokenizerClass, name in zip([BasicTokenizer, RegexTokenizer], ["basic", "regex"]):
+    # construct the Tokenizer object and kick off verbose training
+    tokenizer = TokenizerClass()
+    tokenizer.train(text, 512, verbose=True)
+    # writes two files in the models directory: name.model, and name.vocab
+    prefix = os.path.join("models", name)
+    tokenizer.save(prefix)
+t1 = time.time()
+print(f"Training took {t1 - t0:.2f} seconds")