System prompts ignored in chat completions
From https://huggingface.co./microsoft/Phi-3-mini-4k-instruct-gguf/discussions/11 :
As of the most recent upload, the template in the published quants lists the chat template as:
{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '
' + message['content'] + '<|end|>' + '
' + '<|assistant|>' + '
'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '
'}}{% endif %}{% endfor %}
...which has the net result of ignoring any system prompt passed in.
The breaking change is commit 300945e90b6f55d3cb88261c8e5333fae696f672.
I also have this problem!
The model has not been optimized for the system instruction and produces better generations without it.
That’s why we opted to remove altogether any reference to system. Try appending it to your first user prompt, should work better than a separate system instruction.
Perhaps a discussion rather than simply closing the issues is in order.
Why do you feel that ignoring parameters from the user is better than conforming to the API contract? Would revising the template to treat the system prompt as an additional user prompt not achieve the goal you set out in the thread on the GGUF repo?
I second this^
The user has an expectation that system prompts will be used if they are included in a given dataset. I’d prefer an approach like the one outlined above for GGUF or if you’re going to break this contract completely, it should be widely publicized on the model card
@jrc is correct; we don't have the ability to re-open closed discussions.
In my application, I've used the "microsoft/Phi-3" as a magic string to change behaviour - I place the system prompt in a <|user|> block before the rest of the conversation. It seems to work acceptably, and would be implementable in the Jinja template with a swap out of:
{% if (message['role'] == 'user') %}
with
{% if (message['role'] == 'user' or message['role'] == 'system') %}
@jrc
is correct; we don't have the ability to re-open closed discussions.
Oh god, 100% my bad then, I thought everyone was able to re-open a discussion. Well, now that I know this, I will stop closing them lol
We are doing some ablations between including system as an additional <|user|> conversation and prepending the prompt on the first <|user|> conversation.
Will let you know soon the results!
I'm hopping from foot to foot as well. Would love to remove this model-specific hack from my inference app.
Hi, any update to this issue? Thanks in advance
Hi @gugarosa (or someone from the HF / Microsoft team),
Pinging this thread again - I'm a maintainer on torchtune, where we've included some versions of the Phi-3 model for users to finetune. Currently we include the system prompt as this is what the paper and original model did but obviously this means that our users will not have the same results as users of Hugging Face's SFT Trainer. Therefore, this has been a point of confusion or silent errors.
It would be helpful to have an official recommendation - preferably with the aforementioned ablation results - on how we should handle the system prompt.
Thanks!
Thank you all for your feedback! We recently update the model which allows the system prompt. We would love to continue receive your comments and suggestions.
Thanks @nguyenbh ! Can you share general conclusions of the ablation per @gugarosa 's comment? In general, should we be using the system prompt?
We are doing some ablations between including system as an additional <|user|> conversation and prepending the prompt on the first <|user|> conversation.
Will let you know soon the results!
It's not as good as a proper ablation study, but I did an experiment on a single dataset exploring some of the questions in this thread.
I am doing LoRA fine-tuning with torchtune
. My dataset has input/output pairs. I also have a prompt and few-shot examples. For example:
Let's say my training samples are like:
input, output
Input 1, Output 1
Input 2, Output 2
And my few-shot examples are like:
input, output
Example Input 1, Example Output 1
Example Input 2, Example Output 2
And I have the prompt, "My awesome prompt."
In the image below, you'll see LoRA loss curves on the training set with the following color code:
- Red: No prompt or few-shot examples, just input/output pairs
- I.e., the model is trained on a string like:
<|user|>Input 1<|end>\n<|assistant|>Output1<|end|>\n<|endoftext|>
- I.e., the model is trained on a string like:
- Blue: The prompt and few-shot examples smushed into the training example's
<|user|>
input- Strings like:
<|user|>My awesome prompt. Example Input 1\n Example Output 1\n ... Input 1<|end>\n<|assistant|>Output 1<|end|>\n<|endoftext|>
- Strings like:
- Green: Proper adherence to the template, with the prompt in
<|system|>
, the few-shot examples in<|user|>...<|assistant|>
pairs, and then a final<|user|>
/<|assistant|>
pair for the training example.- Strings like:
<|system|>My awesome prompt.<|end|>\n<|user|>Example Input 1<|end>\n<|assistant|>Example Output 1<|end|>\n...Input 1<|end>\n<|assistant|>Output 1<|end|>\n<|endoftext|>
- Strings like:
My conclusion from the results is that if you're going to do fine-tuning, it doesn't really matter if you smush it all into the first user input, or use the recommended template. Note that the green curve continues the same number of steps as the other curves, but is invisible beyond a certain point because it becomes indistinguishable from the blue in this viz.
Thanks for running an experiment @WelcomeAIOverlords , huge help! Do you also have results w/ validation loss and accuracy?