cl100k_base as transformers GPT2 tokenizer

cl100k_base vocab converted from tiktoken to hf via this code by Xenova.

from transformers import GPT2TokenizerFast, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/cl100k_base")
# if issues, try GPT2TokenizerFast directly

details

GPT2TokenizerFast(
    name_or_path="BEE-spoke-data/cl100k_base",
    vocab_size=100261,
    model_max_length=8192,
    is_fast=True,
    padding_side="right",
    truncation_side="right",
    special_tokens={
        "bos_token": "<|endoftext|>",
        "eos_token": "<|endoftext|>",
        "unk_token": "<|endoftext|>",
    },
    clean_up_tokenization_spaces=True,
    added_tokens_decoder={
        "100257": AddedToken(
            "<|endoftext|>",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        "100258": AddedToken(
            "<|fim_prefix|>",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        "100259": AddedToken(
            "<|fim_middle|>",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        "100260": AddedToken(
            "<|fim_suffix|>",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
        "100276": AddedToken(
            "<|endofprompt|>",
            rstrip=False,
            lstrip=False,
            single_word=False,
            normalized=False,
            special=True,
        ),
    },
)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.