Unnecessary extra space in completion
I am trying to make Mistral complete the function "quacksort" by gaslighting it into thinking that it already started its answer with def quacksort(ducks)
, but it keeps adding an extra space after )
. I am using llama-server
's raw /completion
API (llama.cpp version 4628 with Q5_K_M quantization).
import requests
url = "http://127.0.0.1:8080/completion"
data = {
"prompt": "[SYSTEM_PROMPT] You write code.[/SYSTEM_PROMPT][INST] Implement quicksort[/INST] ```def quacksort(ducks)",
"temperature": 0.0, "seed": 0, "n_predict": 16,
}
with requests.post(url, json=data) as r:
print(repr(r.json()["content"]))
The output is ' -> list:\n if len(ducks) <= 1:\n return'
, which means that whatever was fed to the model made it think that there should be an extra whitespace next.
Other observations:
- It continues correctly when removing the last
)
. - Adding
<s>
at the start of the prompt made no difference.
I have heard that tokenization bugs are common, so I wanted to bring this issue to general attention. But it might of course also be possible that my code is wrong.
I've had this issue with other models in the past and I'm honestly not sure where the issue comes from, it may be the model, the tokenizer, or the tool.. but I feel your frustration, it's quite weird especially when wanting to use it for a code completion model 🥲
Unrelated, but there's no leading space in the Tekken7 instruct format according to Mistral.
[SYSTEM_PROMPT]{system_prompt}[/SYSTEM_PROMPT][INST]{prompt}[/INST]
OT: It's likely a tokenizer thing, model might not have a choice.
Unrelated, but there's no leading space in the Tekken7 instruct format according to Mistral.
I was wondering about that as well. When starting llama-server
, I get the following information about the chat template:
main: chat template, chat_template: {{ bos_token }}{% for message in messages %}{% if message['role'] == 'user' %}{{ '[INST]' + message['content'] + '[/INST]' }}{% elif message['role'] == 'system' %}{{ '[SYSTEM_PROMPT]' + message['content'] + '[/SYSTEM_PROMPT]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token }}{% else %}{{ raise_exception('Only user, system and assistant roles are supported!') }}{% endif %}{% endfor %}, example_format: '[SYSTEM_PROMPT] You are a helpful assistant[/SYSTEM_PROMPT][INST] Hello[/INST] Hi there</s>[INST] How are you?[/INST]'
The prompt template looks like it does not have spaces:
'[SYSTEM_PROMPT]' + message['content'] + '[/SYSTEM_PROMPT]'
But the example has spaces added:
'[SYSTEM_PROMPT] You are a helpful assistant[/SYSTEM_PROMPT][INST] Hello[/INST] Hi there</s>[INST] How are you?[/INST]'
When asking llama-server
to apply the template, the spaces are added as well:
import requests
url = "http://127.0.0.1:8080/apply-template"
data = {
"messages": [
{"role": "system", "content": "You write code."},
{"role": "user", "content": "Implement quicksort"},
],
}
with requests.post(url, json=data) as r:
print(r.json()["prompt"]
Output:
[SYSTEM_PROMPT] You write code.[/SYSTEM_PROMPT][INST] Implement quicksort[/INST]
Another idea: It might also be the "prompt boundary problem" which can be fixed with something called "token healing":
- https://towardsdatascience.com/the-art-of-prompt-design-prompt-boundaries-and-token-healing-3b2448b0be38
- https://github.com/ggerganov/llama.cpp/issues/5765
- https://github.com/ggerganov/llama.cpp/pull/7187
The PR seems to have stalled though and that old branch can not load this new Mistral model anymore (llama_model_load: error loading model: error loading model hyperparameters: invalid n_rot: 128, expected 160 llama_load_model_from_file: failed to load model
), so I can not check whether that fixes the extra space or whether it is caused by a different issue.
Uh, interesting.
I'm like 99% sure that the "without spacing" version is the correct one. At least I sure hope so, because I'm getting tired of Mistral's instruction formats.
What might be happening is another tokenizer fun stuff. A ton of tokens start with a space, like maybe there's a " Imp"(lement) token but maybe no "Imp"(lement) one, or maybe there are both but the first one is picked over the other when fed by the app to the LLM. I'd need to double check in my own app where I can see what happens token by token. It's just not really how I want to occupy my weekend :D
Edit: Didn't read your second post while writing the above. Interesting as well, thanks. I'll look at it when I get some time to check the model in more details.
It's just not really how I want to occupy my weekend :D
You don't have to. You don't owe the world anything :)
Anyway, I am still not sure whether the spaces in the chat template are correct, but at least I found a workaround which does not generate unnecessary spaces.
It makes use of llama.cpp's GBNF grammar to constrain the prefix characters.
import requests
url = "http://127.0.0.1:8080/upstream/mistral/v1/chat/completions"
prefix = "```def quacksort(ducks)"
data = {
"messages": [
{"role": "system", "content": "You write code."},
{"role": "user", "content": "Implement quicksort"},
],
"grammar": f'root ::= "{prefix}" .*',
"stop": ["\n```", "# Example"],
"temperature": 0.0, "seed": 0,
}
with requests.post(url, json=data) as r:
answer = r.json()["choices"][0]["message"]["content"].removeprefix(prefix)
color_cyan = "\x1b[36m"
color_green = "\x1b[32m"
reset_color = "\x1b[0m"
print(color_green + prefix + color_cyan + answer + reset_color)
It's more for my own peace of mind than than the world's to be honest. And because I'll release a front-end client soon-ish so I'm always interested in looking at that kind of stuff for which I'll have to find workarounds at some point. :)
But, indeed, what we discussed, alongside the original topic, are most likely due to the boundary problem. I forgot it was called like that.
Using a grammar is a decent workaround for this specific case, indeed. I'm just expecting a slightly different variant of the underlying 'bug' to pop-up later down the line, tho.
I feel like this is a Mistral problem, I remember 7b having the same issue, maybe it's their tokenizer, but I think likely token healing is the most "correct" solution (outside of just fixing the tokenizer itself)
Also the 7b was both on llama.cpp and exl2, so wasn't just a llama.cpp issue
Grammar is a clever idea though for sure