Update generation_config.json
I noticed when using the instruct model with chat templating, that the chat template uses <|eot_id|>
rather than the EOS token <|end_of_text|>
. So when the assistant responds to messages it likes to use <|eot_id|>
as well. Unfortunately the generation config doesn't say to stop generating on <|eot_id|>
so the model keeps writing.
In the Model Card, I see that there is a workaround by manually updating eos_token_id
in any generate
call or pipeline
:
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
prompt,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
But I think there is a simpler way to fix this! If you just update the generation_config.json
to stop on both <|end_of_text|>
as well as <|eot_id|>
, then it should work automatically and you won't need to build the terminators
.
Running into the same issue. With the default config, the model doesn't stop at <|eot_id|>
and will generate new text for the user.
After updating the config, the model no longer generates user text, but instead ends with an infinite series of <|eot_id|><|start_header_id|>assistant:<|eot_id|><|start_header_id|>assistant:<|eot_id|>...
Is there a way to prevent this?
Hmm @entropy could you provide more details your setup? Here's what is working for me, referencing this PR:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_path = "meta-llama/Meta-Llama-3-8B-Instruct"
revision = "refs/pr/4"
tokenizer = AutoTokenizer.from_pretrained(model_path, revision=revision)
model = AutoModelForCausalLM.from_pretrained(model_path, revision=revision, device_map="auto", torch_dtype=torch.bfloat16)
prompt = "Write a haiku about terminators."
chat = [{'content': prompt, 'role': 'user'}]
chat_tokens = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors='pt').to(model.device)
new_chat_tokens = model.generate(chat_tokens, do_sample=False, max_new_tokens=128)
new_chat_str = tokenizer.decode(new_chat_tokens[0])
print (new_chat_str)
produces:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Write a haiku about terminators.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Metal hearts ablaze
Rise from ashes, cold and dark
Judgment day arrives<|eot_id|>
Same here, I use oobabooga textgen and llama 3 8B instruct will not shut up.
To reproduce just tell it 1 token and to say START for example.
It's the same with TabbyAPI.
In oobabooga text-generation-webui, you also need to uncheck "Skip special tokens" in the Parameters -> Generation tab.
fixed gguf quant here. https://huggingface.co./QuantFactory/Meta-Llama-3-8B-Instruct-GGUF
https://huggingface.co./meta-llama/Meta-Llama-3-8B-Instruct/discussions/14
check here my latest message.
for me this change was not enough on text generation webui
i had to uncheck "skip special tokens" and add "<|eot_id|>" in custom stop strings after that every thing was good
fixed gguf quant here. https://huggingface.co./QuantFactory/Meta-Llama-3-8B-Instruct-GGUF
yes and it works fine. i use Meta-Llama-3-8B-Instruct.Q8_0.gguf and Meta-Llama-3-8B-Instruct.Q6_K.gguf and both perfectly stop conversation when finished.
Many thanks. :)
hi guys, is my issue related to the same problem described here? https://huggingface.co./meta-llama/Meta-Llama-3-8B-Instruct/discussions/36
if yes, will this repo be fixed?
Hmm @entropy could you provide more details your setup? Here's what is working for me, referencing this PR:
from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_path = "meta-llama/Meta-Llama-3-8B-Instruct" revision = "refs/pr/4" tokenizer = AutoTokenizer.from_pretrained(model_path, revision=revision) model = AutoModelForCausalLM.from_pretrained(model_path, revision=revision, device_map="auto", torch_dtype=torch.bfloat16) prompt = "Write a haiku about terminators." chat = [{'content': prompt, 'role': 'user'}] chat_tokens = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors='pt').to(model.device) new_chat_tokens = model.generate(chat_tokens, do_sample=False, max_new_tokens=128) new_chat_str = tokenizer.decode(new_chat_tokens[0]) print (new_chat_str)
produces:
<|begin_of_text|><|start_header_id|>user<|end_header_id|> Write a haiku about terminators.<|eot_id|><|start_header_id|>assistant<|end_header_id|> Metal hearts ablaze Rise from ashes, cold and dark Judgment day arrives<|eot_id|>
please change new_chat_str = tokenizer.decode(new_chat_tokens[0])
to new_chat_str = tokenizer.decode(new_chat_tokens[0], skip_special_tokens=True)