isaacus/kanon-tokenizer · Hugging Face

The Kanon tokenizer is the world's most space efficient legal document tokenizer of its size.

With a vocabulary of only 65,536 tokens, documents compressed with the tokenizer are capable of being stored as unsigned 16-bit integers, reducing memory requirements dramatically over larger vocabularies.

The Kanon tokenizer is already being used in production by all of Isaacus' currently available Kanon models.

The Kanon tokenizer was trained on Isaacus' Blackstone Corpus, one of the world’s largest private repositories of contracts, decisions, legislation and other legal and government documents, covering a wide range of jurisdictions, including the U.S., U.K., Canada, Australia, New Zealand, Ireland, the entire European Union, the United Nations and the International Court of Justice, to name a few.

The Kanon tokenizer is licensed freely, including for commercial usage, under the Apache 2.0 license. We actively encourage legal AI practioners, including our own competitors, to take advantage of the Kanon tokenizer when training their legal AI models to promote better interoperability between models while also improving their space efficiency.