I have identified a discrepancy in the token count when generating responses using the Zephyr model. When providing a question with a specific max_tokens value, when attempting to count tokens in the text provided in the response using the Hugging Face tokenizer (len(pipe.tokenizer.encode(response))), the result exceeds the specified max_new_tokens (Sometimes, we observe an addition of more than 80 tokens.). This behavior is consistent across both LLama 2 and Zephyr models

Code :
import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-alpha", torch_dtype=torch.bfloat16, device_map="auto")

chat_templating

messages = [
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=100, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(len(pipe.tokenizer.encode(outputs[0]["generated_text"]))) # i get 116

HuggingFaceH4
/

zephyr-7b-alpha

Incorrect Token Count in Generated Response

We use the tokenizer's chat template to format each message - see https://huggingface.co./docs/transformers/main/en/chat_templating