System prompts ignored in chat completions

#51
by joshuaturner - opened

From https://huggingface.co./microsoft/Phi-3-mini-4k-instruct-gguf/discussions/11 :

As of the most recent upload, the template in the published quants lists the chat template as:

{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '
' + message['content'] + '<|end|>' + '
' + '<|assistant|>' + '
'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '
'}}{% endif %}{% endfor %}

...which has the net result of ignoring any system prompt passed in.

The breaking change is commit 300945e90b6f55d3cb88261c8e5333fae696f672.

I also have this problem!

Microsoft org

The model has not been optimized for the system instruction and produces better generations without it.

That’s why we opted to remove altogether any reference to system. Try appending it to your first user prompt, should work better than a separate system instruction.

gugarosa changed discussion status to closed

Perhaps a discussion rather than simply closing the issues is in order.

Why do you feel that ignoring parameters from the user is better than conforming to the API contract? Would revising the template to treat the system prompt as an additional user prompt not achieve the goal you set out in the thread on the GGUF repo?

I second this^

The user has an expectation that system prompts will be used if they are included in a given dataset. I’d prefer an approach like the one outlined above for GGUF or if you’re going to break this contract completely, it should be widely publicized on the model card

Microsoft org
This comment has been hidden
gugarosa changed discussion status to open

@gugarosa Thanks for the follow-up - very eager to hear the report from the MSFT team responsible for finetuning of Phi-3.

(FYI, I believe only repository admin are able to re-open closed Discussions)

@jrc is correct; we don't have the ability to re-open closed discussions.

In my application, I've used the "microsoft/Phi-3" as a magic string to change behaviour - I place the system prompt in a <|user|> block before the rest of the conversation. It seems to work acceptably, and would be implementable in the Jinja template with a swap out of:

{% if (message['role'] == 'user') %}

with

{% if (message['role'] == 'user' or message['role'] == 'system') %}


@jrc
	 is correct; we don't have the ability to re-open closed discussions.

Oh god, 100% my bad then, I thought everyone was able to re-open a discussion. Well, now that I know this, I will stop closing them lol

Microsoft org

We are doing some ablations between including system as an additional <|user|> conversation and prepending the prompt on the first <|user|> conversation.

Will let you know soon the results!

Following up on this @gugarosa - any results to share?

I'm hopping from foot to foot as well. Would love to remove this model-specific hack from my inference app.

Hi, any update to this issue? Thanks in advance

Hi @gugarosa (or someone from the HF / Microsoft team),

Pinging this thread again - I'm a maintainer on torchtune, where we've included some versions of the Phi-3 model for users to finetune. Currently we include the system prompt as this is what the paper and original model did but obviously this means that our users will not have the same results as users of Hugging Face's SFT Trainer. Therefore, this has been a point of confusion or silent errors.

It would be helpful to have an official recommendation - preferably with the aforementioned ablation results - on how we should handle the system prompt.

Thanks!

Microsoft org

Thank you all for your feedback! We recently update the model which allows the system prompt. We would love to continue receive your comments and suggestions.

Thanks @nguyenbh ! Can you share general conclusions of the ablation per @gugarosa 's comment? In general, should we be using the system prompt?

We are doing some ablations between including system as an additional <|user|> conversation and prepending the prompt on the first <|user|> conversation.

Will let you know soon the results!

Microsoft org

@aladar With the latest update June 2024, you can use the system prompt. The example in model card can be a starting point.

Hello @nguyenbh and thank you for adding support for the system prompt. Do you know if this change will be propagated to the larger context variants of Phi3? The Phi3 128k mini and medium. Currently the change does not appear to be there yet.

Microsoft org

@aieat Thank you for your interest in Phi-3 model family.
The change is propagated to Mini-128K. Other models have no update.

nguyenbh changed discussion status to closed

It's not as good as a proper ablation study, but I did an experiment on a single dataset exploring some of the questions in this thread.

I am doing LoRA fine-tuning with torchtune. My dataset has input/output pairs. I also have a prompt and few-shot examples. For example:
Let's say my training samples are like:

input, output
Input 1, Output 1
Input 2, Output 2

And my few-shot examples are like:

input, output
Example Input 1, Example Output 1
Example Input 2, Example Output 2

And I have the prompt, "My awesome prompt."

In the image below, you'll see LoRA loss curves on the training set with the following color code:

  • Red: No prompt or few-shot examples, just input/output pairs
    • I.e., the model is trained on a string like: <|user|>Input 1<|end>\n<|assistant|>Output1<|end|>\n<|endoftext|>
  • Blue: The prompt and few-shot examples smushed into the training example's <|user|> input
    • Strings like: <|user|>My awesome prompt. Example Input 1\n Example Output 1\n ... Input 1<|end>\n<|assistant|>Output 1<|end|>\n<|endoftext|>
  • Green: Proper adherence to the template, with the prompt in <|system|>, the few-shot examples in <|user|>...<|assistant|> pairs, and then a final <|user|> / <|assistant|> pair for the training example.
    • Strings like: <|system|>My awesome prompt.<|end|>\n<|user|>Example Input 1<|end>\n<|assistant|>Example Output 1<|end|>\n...Input 1<|end>\n<|assistant|>Output 1<|end|>\n<|endoftext|>

image.png

My conclusion from the results is that if you're going to do fine-tuning, it doesn't really matter if you smush it all into the first user input, or use the recommended template. Note that the green curve continues the same number of steps as the other curves, but is invisible beyond a certain point because it becomes indistinguishable from the blue in this viz.

Thanks for running an experiment @WelcomeAIOverlords , huge help! Do you also have results w/ validation loss and accuracy?

Sign up or log in to comment