Crash while loading tokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('THUDM/LongCite-llama3.1-8b', trust_remote_code=True)
results in
FileNotFoundError Traceback (most recent call last)
Cell In[2], line 1
----> 1 tokenizer = AutoTokenizer.from_pretrained('THUDM/LongCite-llama3.1-8b', trust_remote_code=True)
File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:847, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
845 if os.path.isdir(pretrained_model_name_or_path):
846 tokenizer_class.register_for_auto_class()
--> 847 return tokenizer_class.from_pretrained(
848 pretrained_model_name_or_path, *inputs, trust_remote_code=trust_remote_code, **kwargs
849 )
850 elif config_tokenizer_class is not None:
851 tokenizer_class = None
File ~/.cache/huggingface/modules/transformers_modules/THUDM/LongCite-llama3.1-8b/8265f5e5bceab232605db43e6e0c6579ff941354/tiktoken_tokenizer.py:58, in TikTokenizer.from_pretrained(path, *inputs, **kwargs)
56 @staticmethod
57 def from_pretrained(path, *inputs, **kwargs):
---> 58 return TikTokenizer(vocab_file=os.path.join(path, "tokenizer.tiktoken"))
File ~/.cache/huggingface/modules/transformers_modules/THUDM/LongCite-llama3.1-8b/8265f5e5bceab232605db43e6e0c6579ff941354/tiktoken_tokenizer.py:67, in TikTokenizer.__init__(self, vocab_file)
65 if vocab_file is not None:
66 mergeable_ranks = {}
---> 67 with open(vocab_file) as f:
68 for line in f:
69 token, rank = line.strip().split()
FileNotFoundError: [Errno 2] No such file or directory: 'THUDM/LongCite-llama3.1-8b/tokenizer.tiktoken'
yes, the same issue.
A workaround is to download the model locally (with huggingface_cli download) and load it via path instead of model id
ok, thanks for your workaround!
Awesome Thank you for the Workaround.
Here is a bit more Detail for those who use paths instead of ids for the first time like me :)
huggingface-cli download https://huggingface.co./THUDM/LongCite-llama3.1-8b/tree/main
Adjust for local path ->! important to provide snapshot ! only /home/someuser/.cache/huggingface/hub/models--THUDM--LongCite-llama3.1-8b/ wont work
tokenizer = AutoTokenizer.from_pretrained('/home/someuser/.cache/huggingface/hub/models--THUDM--LongCite-llama3.1-8b/snapshots/58260b89bc2a547b814f44b89914b1e282b2d5cd/', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
'/home/someuser/.cache/huggingface/hub/models--THUDM--LongCite-llama3.1-8b/snapshots/58260b89bc2a547b814f44b89914b1e282b2d5cd/',
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map='auto'
)
To the developers: Thank you for this amazing model. I had high expectations, and they have been surpassed.
Thanks for pointing out this bug. We have fix it now.