Please consider changing the instruct format!

#10
by MarinaraSpaghetti - opened

Dear MistralAI Team! Just to be completely transparent: I absolutely love your models and your work β€” you guys are the best! However, I think I'm speaking on behalf of the entire community when I say: please, PLEASE consider ditching the gods-awful Mistral Instruct format!

It's abysmal to work with, especially for the folks who use Text Completion API to run the models (which is what Oobabooga's WebUI or Koboldcpp are offering). It's simply confusing to set up and very rigid, especially if someone is working with dynamic context insertions (lore book entries added at different depths, for example). People are also unsure how the correct format for each new model looks like β€” even legends like Bartowski himself got it mixed up. Hell, I myself had to double-check the NeMo one with one of your team members, only to learn that I've also been using it incorrectly!

Not to mention, the lack of proper system prompt is actively harming the model's possible capabilities and its reasoning, especially on higher contexts. We need a clear distinction between different roles: system, user, and assistant. The format should recognize the system prompt as the main instruction to be followed at all times β€” it's how most, if not all, other formats handle this. Take ChatML, for example. Because it just works.

Even adding a simple [SYSTEM]/[/SYSTEM] tag would help IMMENSELY. Or leaving [INST]/[/INST] for the system, while adding [USER]/[/USER] for the user. Here is my ideal example, using Mistral Small's format:

<s>
[SYSTEM]
{{System Prompt}}
[/SYSTEM]
[INST]
{{User's Message}}
[/INST]
{{Assistant's Response}}
</s>

Please, I beg you on my knees β€” with tears, snorts, and all, leaking down my face β€” at least consider this possibility. You will make our lives easier and your models even superior to what they already are. Thank you.

Mistral AI_ org
β€’
edited 4 days ago

Hi @MarinaraSpaghetti ,
We have actually wrote a document in the cookbook repo delving into this, it explains the slight difference between each tokenizer and also what should be used as ground truth in details: https://github.com/mistralai/cookbook/blob/main/concept-deep-dive/tokenization/chat_templates.md , We will increase the amount of documentation regarding the tokenizers and chat templates to help out! Hopefully this document will answer most of your doupts!

Hi @MarinaraSpaghetti ,
We have actually wrote a document in the cookbook repo delving into this, it explains the slight difference between each tokenizer and also what should be used as ground truth in details: https://github.com/mistralai/cookbook/blob/main/concept-deep-dive/tokenization/chat_templates.md , We will increase the amount of documentation regarding the tokenizers and chat templates to help out! Hopefully this document will answer most of your doupts!

Hey @pandora-s , thank you, the document is super useful! I've been there when it was being conceived, and I'm still super grateful to you for creating it and sending it over on Discord. :)

But sadly, this doesn't answer the main concern I've raised β€” the lack of proper system/user/assistant distinction in the Mistral format itself. Due to that, the model forgets its initial instructions and doesn't work as good as intended in multi-turn conversations. Especially on higher contexts, like I previously mentioned. I know you can not confirm whether you'll be changing the format or not in the future. All I'm asking for is to at least consider my request and re-evaluate how the format could potentially look like, taking into consideration what other successful models are using. It would be amazing, truly.

Once again, thank you for all the hard work and for the incredible job you're doing!

Hi @MarinaraSpaghetti ,
We have actually wrote a document in the cookbook repo delving into this, it explains the slight difference between each tokenizer and also what should be used as ground truth in details: https://github.com/mistralai/cookbook/blob/main/concept-deep-dive/tokenization/chat_templates.md , We will increase the amount of documentation regarding the tokenizers and chat templates to help out! Hopefully this document will answer most of your doupts!

Wow, thanks for the boilerplate response! /s

Here's the problem with it, it's not standard, and I don't see a reason for it to have to use its own, why not use ChatML like @MarinaraSpaghetti suggested? ChatML is at least recognizable by most frontends, and the instruct format itself doesn't make sense.

[INST] {{User's Message}} [/INST] {{Assistant's Response}} </s>

okay? Why do you need to use INST twice? It doesn't denote which is which outside one being {{user}} and the other being {{assistant}}, one little screw up and you've confused yourself thinking it's broken. Said cookbook doesn't explain any advantage or use other than the prompt itself. I've noticed a similar problem via Mistral's API, where this prompt would ruin interactions because it itself doesn't know who is who.

This isn't me trying to hate said models or your business, it is simply my own opinion on the falters of this format, it isn't personal.

I completely agree that the current format of mistral is very inflexible. Moreover, when something like two answers in a row from the user or the need for two or more answers from the assistant happens, then you need to really erupt so that llm understands everything correctly, and this will be a very cumbersome construction of prefixes and affixes, or create some kind of separate layer that will reformat it for the mistral format. This is all incredibly inconvenient, not intuitive.
I really love mistral models, but it is also so painful to use the current format of mistral instructions.

Mistral AI_ org
β€’
edited 4 days ago

where this prompt would ruin interactions because it itself doesn't know who is who.

@ProdeusUnity This sounds like a misunderstanding, as explained in the document, the strings are only representations, the model never sees the string, it sees an ID token directly that is not correlated to any string. Its the same as with the BOS and EOS tokens, they are not actually <s> nor </s>, these are simply representations. It doesnt matter if its [INST] or <user> (this would only matter for the tokenizer v1), the model never sees the strings since they became control tokens instead, they only see an ID specially dedicated to it, this also avoids any possibility of prompt injections making it extra safe!

We could in theory change the documentation and rename the control token to <user> without touching the model and having it keep the exact same behavior as normal, the strings equivalent of control tokens are simply representations for the users and developers.

Could you share more about this issue you mention? Couldnt it be an issue possibly with the System prompts like mentionned previously by @MarinaraSpaghetti ?

Mistral AI_ org

I completely agree that the current format of mistral is very inflexible. Moreover, when something like two answers in a row from the user or the need for two or more answers from the assistant happens, then you need to really erupt so that llm understands everything correctly, and this will be a very cumbersome construction of prefixes and affixes, or create some kind of separate layer that will reformat it for the mistral format. This is all incredibly inconvenient, not intuitive.
I really love mistral models, but it is also so painful to use the current format of mistral instructions.

Thanks for the feedback!

+1 will echo the sentiment here. I love the mistralai models, but finetuning is always hell because of the formats, edge cases, extremely sensitive, and the whitespaces...etc. It would be amazing if it followed a simple, easy to read, human readable format :)

We could in theory change the documentation and rename the control token to without touching the model and having it keep the exact same behavior as normal, the strings equivalent of control tokens are simply representations for the users and developers.

This would be a good starting point (I didn't know you could do that!). Even developers struggle with all the formats and their quirks/edge cases + all the libraries and their own layers.

Even developers struggle with all the formats and their quirks/edge cases + all the libraries and their own layers.

Hi, developer here.

I'm practically in love with Mistral as a company and the performance of the Mistral models is consistently, delightfully very impressive... on the API.

However I have never quite been able to 100% match the performance I'm seeing on the API with the local GGUFs. Every time I think I have the prompt format finally perfectly figured out, a new model comes out with a very slightly tweaked tokenizer, or there's a new post on r/LocalLLaMA detailing the new, finally-figured-out prompt format, whether to add whitespace or not, a new off-label way to wrangle the model into obeying a system prompt... it's so frustrating because I can see that there is such potential and there's just some little prompt format weirdness holding it back.

If it was only me having this issue I would think I was stupid and that would be the end of it. But it's clearly not.

I would suggest something like this:

[SYSTEM]
Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.
[/SYSTEM]
[USER]
hello
[/USER]
[BOT]
Hello! How can I assist you today? Let me know if you have any questions, need some advice, or just want to chat. I'm here to help! 😊
[/BOT]
[USER]
thank you
[/USER]

etc.

I really hope I'm not sounding ungrateful, I truly appreciate the contributions Mistral's made to the community. I just think the current format is really holding the models back.

@pandora-s β€οΈπŸ€—

+1 on one of the best models having, sadly, one of the worst instruct format I had the displeasure of dealing with. It's VERY inflexible, the "add a space before and after, oh, in fact, don't anymore, well it depends on the tokenizer" has been very annoying and wholly unnecessary. The "rolling system prompt" (at the end of the last query) objectively worsen the model's output outside of extremely specific tasks, and I suspect it to be the cause for most fine-tunes to be, on average, pretty brain-dead out of the box. It's also conflicting with most/all methods used to reduce prompt processing computation.

I get that you want to get rid of system prompts altogether (at least that's what I'm guessing, only reason for this esoteric formatting, really), but still.

Mistral AI_ org

Hey,

Mistral employee here! Some questions & answers to better understand the context.

    1. Is anyone using mistral_common: https://github.com/mistralai/mistral-common for inference / fine-tuning? At the "Request" abstraction level the distinction between user, system, and assistant is pretty clear no?
      E.g. It should be quite easy to write a multi-turn message with system prompt for mistral-nemo as follows:
from mistral_common.protocol.instruct.messages import UserMessage, SystemMessage, AssistantMessage
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.request import ChatCompletionRequest

# Load Mistral tokenizer

model_name = "mistral-nemo"

tokenizer = MistralTokenizer.from_model(model_name)

request = ChatCompletionRequest(
        messages=[
            SystemMessage(content="You are a helpful assistant"),
            UserMessage(content="What's the weather like today in Paris"),
            AssistantMessage(content="The weather is good."),
            UserMessage(content="How is it in London?"),
        ],
        model=model_name,
)

tokenizer.encode_chat_completion(request)

If possible for you, I strongly recommend to work with this level of abstraction and mistral_common. Especially for fine-tuning, but also for inference. It's battle-tested and you surely won't have any incorrect formatting.

    1. I assume that many libraries you're referring to / are using instead work on the "string" format of the chat request. When looking at this, I fully see your point. The way the system message, the user message & the [INST] token are handled is confusing. E.g. when looking at the parsed output of the above sample:

<s>[INST]What's the weather like today in Paris[/INST]The weather is good.</s>[INST]You are a helpful assistant\n\nHow is it somewhere else?[/INST]

it's very confusing that the system prompt is all of a sudden moved just in front of the first message with a [INST] tag. With @pandora-s , we'll transmit your message!

My question: Why do you have to work with the text format? Why is it difficult to instead directly work with the chat completion format? (sorry if this is a stupid question, would like to better understand)

    1. Generally, the following mapping:
      chat_completion_request -> string -> tokens
      It's much more error prone than:
      chat_completion_request -> tokens

Ideally, one should never have to worry about an [INST] token - in mistral_common we never make use of the actually [INST] text token - it's just a representation of a "special token" that's used to indicate the beginning and end of a user message for the model. We always directly map the request to the tokens so that there is no possibility of any whitespace errors etc... to happen. If you can use the direct chat request to token mapping, please do - it's much less error prone & you'll be able to support future models always out of the box

Thanks so much for your thorough response, @patrickvonplaten !

You asked:

Why do you have to work with the text format? Why is it difficult to instead directly work with the chat completion format? (sorry if this is a stupid question, would like to better understand)

It's not a stupid question at all. I can't speak for everyone, but I can explain my personal experience as a hobbyist developer and active member of local inference scene for the past two years.

Basically, 95% of hobbyists/enthusiasts/amateur devs are working with llama.cpp, either directly or through a wrapper like Ollama, LM Studio, Koboldcpp, Sillytavern, llama-cpp-python, easy-llama, ... etc. We use these programs because they work with any model that is vaguely llama-like. One huge example is the use of GGUF quants, which is more or less the agreed-upon format for distribution of models once they've been quantized. Many people don't even know how to perform quantization themselves and are only able to download GGUFs to plug-and-play with Ollama, LM studio, etc.

For example, looking through mistral-common, I don't see any way to load a quantized model for inference on consumer hardware. Compare this to llama.cpp, which supports various quantizations from 2 bits to 8 bits, as well as being model-agnostic β€” I can easily switch from Mistral-Nemo to Llama 3.1 8B to Command-R+ without having to change my python environment. This sort of functionality is crucial for developers (like me!) who are working with multiple models in the course of a day.

Not to mention the fact that performing inference with unquantized models is impractical or downright impossible for many users, who often only have access to everyday consumer hardware like an old laptop or even just an iPhone.

TLDR:

While mistral_common may be the correct way to run Mistral models, it is not the most widely accessible or user-friendly way of doing so. Most users are using llama.cpp or some derivative which allows for inference with quantized models of all shapes and sizes, so to speak.

I think the question "why don't you work directly with the chat completion format" is a bit strange. If you're working on any sort of frontend project, you will internally be storing the conversation between the user and the assistant as some sort of list of messages, each of which will have their type attached to them. That's the level of abstraction everyone is already working with, because it's what is natural for that use case and is the first idea any developer would have.

But in reality, this question is asking "why don't you work with the Mistral Python library?"

And the answer to that is... well, because I'm not using Python? I might have a frontend project written in TypeScript, and an inference backend/API server written in C++ or Rust, with Python or PyTorch or Transformers nowhere in sight. And even if I AM using Python, I might want to have control over everything and write my own code, or some fresh new way of doing things might come along in the future, long after this model and its library have been deprecated (yet I might still want/need to use it).

Furthermore, it doesn't require me to "just work with" the Mistral library. No offense to the Mistral team, I love what you guys are doing, but other models exist and I might want to use them at some point. This approach would require me to install the Mistral library, and the Qwen library, and the Llama library, and the Deepseek library, and the Cohere library, and the Anthropic library, and so on.

"Our prompt format is so complicated, you cannot ever hope to write a correct reimplementation of it yourself, so don't even try" is kind of proving the point raised at the very beginning of this thread. Formatting a string with four placeholder values should not be a task that requires a vendor-specific library.

On a personal level I'm also not very fond of this level of black box "just trust the library" abstraction. As a developer working with language models, I think you SHOULD look directly at the raw text and tokens being generated. I want to be able to debug why my responses suddenly got cut short. I want to see what happens if I let the model generate past the </s> token, or ban the </s> token, or if I insert a second <s> token out of nowhere. I want to experiment with the model and analyze its behavior, and HAVING to rely on a library prevents me from doing that.


By the way, regardless of the chat/text completion discourse and the prompt format, the specific choice of moving the system prompt to be part of the last user message is one that baffles me:

<s>[INST]What's the weather like today in Paris[/INST]The weather is good.</s>[INST]You are a helpful assistant\n\nHow is it somewhere else?[/INST]

I would like to think that the Mistral team has empirical evidence to back up the benefits of this approach, but... doesn't this effectively nullify the system prompt? It doesn't make the model treat it as special in any way - in fact, it has no way to tell it apart from the user's request whatsoever. Moving it around constantly also means that Message 2, to which it gave Response 2, will no longer be the same message after that point (because the system prompt will get cut out of it). The model is given a question it was never asked, then a response which it gave to a completely different question, then a system prompt out of nowhere, bleeding into another question.

<s>
[INST]What is 3 + 2?[/INST]
It's 1.</s>

[INST](SYSTEM: From this point, + and - have their previous meanings reversed.)What is 2 + 1?[/INST]
It's ...

I'm not doubting the possibility that this is actually better than doing it the obvious way, but I'm curious about the rationale behind the choice.

Hey @patrickvonplaten , really appreciating that you'd like to learn more about the issue!

Others already pointed it out nicely, but I'd also like to add that most of the community uses either Koboldcpp or Oobabooga's WebUI as backends, which support Text Completion mode only. Meaning, that they have to format their Instruct formats manually, especially if they're also using separate frontends, such as SillyTavern, for example.

It's also worth mentioning that not everyone is tech-savvy enough to run models via Python and your libraries. The generation software programs I just mentioned are so popular, because they're easy to use and don't require doing any 'scary coding stuff'. I'm not talking about just individuals, companies, or startups will use those options too, especially if they don't have any hired specialists to set it up for them. These are just extra points to consider aside from the previous replies.

Thank you!

This is how it can be done in ST and similar UIs :

  • use system prompt as a variable (also remove the unnecessary tokens)
  • make various adjustments

Screenshot 2024-09-21 at 15-21-45 SillyTavern.png

It's simple and elegant. Now your template is clean (for mistral-small and it's good function calling capabilities).

If your UI is "limited", keeping "last user prefix" as a general user prefix doesn't change the model's behavior much. Because my solution relies on making it remember the system prompt, due to UI's limitations, also I didn't test it on extra long contexts, but I believe in Mistral-small's ability to "look for" the variable.
{3D0B80EB-596F-4FA1-86D9-33B9A45A74BD}.png
{073B846F-4D40-4C29-8C1A-42C6236C672C}.png

Few key points :

  • A better fix would be possible IF what is shown in the above screen capture didn't happen. It's for people making UIs to propose a better solution (it's two lines of code I guess if you know where to look at)
  • I note that system prompt "nesting" is a thing, I won't expand on it but I can see how it can provide security options (you can't retrieve the system prompt with the "include names" option set to ON on ST)
  • Using the "system prompt in last instruction" method has obvious upsides, as LLMs tend to regard last tokens as more important, that kinda help fixing the "repetitive behavior" I see people complaining about a lot, and other quirks

/j @MarinaraSpaghetti Come on, I tought you couldn't get alienated by the chatml lobby.. I'm feeling alone now πŸ˜”


Note to the Mistral Team:

If you need some kind of a community manager (as Wikipedia does to connect editors and developers on a human level), with a good understanding of the technical part of machine learning, I'm 1 hour from Paris, 'looking for a job', can be remote as much as needed (cf. chronic illness, truth is I'm the one in need of flexibility on this side), plus I have experience building a global community focused on opening up solutions that are normally only used by large corporations to small businesses and individuals (without being cringe, the boss are the ones who made the boat sink, they even took my ideas for prospects once I left Joker's laugh).

Thank you for fighting so hard for open source. We'll win.

Margaux Ammour

@Limezero summarized the issues with the instruct format perfectly. The rolling system prompt (if we can really call it a system prompt at all) works only in very specific situations, centered around 1 shot exchanges with the model. The moment you have a few rounds of dialog, it quickly shows its inherent limitations. What also baffles me is that the Nemo and Small models feel desperate for a system prompt indicator of any kind. If you add a "(SYSTEM: blah blah)" or "(CONTEXT: blah blah)" to a prompt, it'll follow those instructions absolutely religiously.

We're not all just users. Some of us are developing tool chains or full stack apps (most of them not in python) which have to be compatible with not only your models but others' as well. Your format without any kind of system prompt, or delimiter for bot responses (i mean, technically, the [/INST] token is the bot response delimiter) makes it really more complicated than it should be. That's before even considering the function calling stuff, which to me, is the biggest draw, and the less documented.

As to why not using mistral_commons as users (and not devs): Simply put, the vast majority of us tend to switch models regularly. Using model-specific software, no matter how good is the model itself (and it is very good), for something as trivial as handling an instruct format, is neither realistic nor practical, especially when there is already a very active and mature open-source ecosystem. That's on top of the issues about accessibility mentioned above me by @MarinaraSpaghetti .

Small and Nemo are also advertised in your own documents and websites as models you can run on your own (beefy) computer, which is only true if they are quantized to Q6, Q4 or even less. Unless you're going to handle the many quantization formats in mistral_common, it's very unlikely to gain any kind of tractions on the hobbyist side. Even then, it still is a python library, which, might be fine for a front-end, but less so as a back-end.

@mammour that screenshot is incorrect. The model already add the EOS token to its output, and your back-end, whichever it is, should handle it by itself. you really shouldn't put it in the user message prefix. You can put in the stop sequence instead.

Mistral AI_ org

Hi everyone once again! I've recently made two PRs one for Sillytavern and one for Kobold, and I will keep interacting and exploring ways to make the templates match.
Thank you all for the feedback, this is all very useful insight, and do not hesitate to ping me if any issues around the tokenizers and chat templates arise at very known projects, will gladly help implementing fixes.

Hey,

Mistral employee here! Some questions & answers to better understand the context.

    1. Is anyone using mistral_common: https://github.com/mistralai/mistral-common for inference / fine-tuning? At the "Request" abstraction level the distinction between user, system, and assistant is pretty clear no?
      E.g. It should be quite easy to write a multi-turn message with system prompt for mistral-nemo as follows:
from mistral_common.protocol.instruct.messages import UserMessage, SystemMessage, AssistantMessage
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.request import ChatCompletionRequest

# Load Mistral tokenizer

model_name = "mistral-nemo"

tokenizer = MistralTokenizer.from_model(model_name)

request = ChatCompletionRequest(
        messages=[
            SystemMessage(content="You are a helpful assistant"),
            UserMessage(content="What's the weather like today in Paris"),
            AssistantMessage(content="The weather is good."),
            UserMessage(content="How is it in London?"),
        ],
        model=model_name,
)

tokenizer.encode_chat_completion(request)

If possible for you, I strongly recommend to work with this level of abstraction and mistral_common. Especially for fine-tuning, but also for inference. It's battle-tested and you surely won't have any incorrect formatting.

    1. I assume that many libraries you're referring to / are using instead work on the "string" format of the chat request. When looking at this, I fully see your point. The way the system message, the user message & the [INST] token are handled is confusing. E.g. when looking at the parsed output of the above sample:

<s>[INST]What's the weather like today in Paris[/INST]The weather is good.</s>[INST]You are a helpful assistant\n\nHow is it somewhere else?[/INST]

it's very confusing that the system prompt is all of a sudden moved just in front of the first message with a [INST] tag. With @pandora-s , we'll transmit your message!

My question: Why do you have to work with the text format? Why is it difficult to instead directly work with the chat completion format? (sorry if this is a stupid question, would like to better understand)

    1. Generally, the following mapping:
      chat_completion_request -> string -> tokens
      It's much more error prone than:
      chat_completion_request -> tokens

Ideally, one should never have to worry about an [INST] token - in mistral_common we never make use of the actually [INST] text token - it's just a representation of a "special token" that's used to indicate the beginning and end of a user message for the model. We always directly map the request to the tokens so that there is no possibility of any whitespace errors etc... to happen. If you can use the direct chat request to token mapping, please do - it's much less error prone & you'll be able to support future models always out of the box

Using the example you posted, I removed the last user message and it errors out when trying to tokenize. How could this work for finetuning the format: system message >> user input >> assistant response? I looked at the repo and examples and still couldn't get it to run without an error. Same example below, but without the last user message. Models using v3 tokenizer do support system/user/assistant chat formats.

from mistral_common.protocol.instruct.messages import UserMessage, SystemMessage, AssistantMessage
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.request import ChatCompletionRequest

# Load Mistral tokenizer

model_name = "mistral-nemo"

tokenizer = MistralTokenizer.from_model(model_name)

request = ChatCompletionRequest(
        messages=[
            SystemMessage(content="You are a helpful assistant"),
            UserMessage(content="What's the weather like today in Paris"),
            AssistantMessage(content="The weather is good."),
        ],
        model=model_name,
)

tokenizer.encode_chat_completion(request)
>>
AttributeError: 'str' object has no attribute 'value'

Hey @nazimali ,

For fine-tuning it's indeed better to work on the "message"-level, also so that masks can be created along side the messages. We want to mask out all user messages for training.

In this case, I'd still create a ChatCompletionRequest object, then validate all messages and tools (see here: https://github.com/mistralai/mistral-finetune/blob/656df1c94c80ca9703ebc471c9f106c9b7a0bfa7/finetune/data/tokenize.py#L180) and then encode each message individually to have control over how to build the masks (see here: https://github.com/mistralai/mistral-finetune/blob/656df1c94c80ca9703ebc471c9f106c9b7a0bfa7/finetune/data/tokenize.py#L289
)

Using your example above & assuming we don't do function calling

from mistral_common.protocol.instruct.messages import UserMessage, SystemMessage, AssistantMessage
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.request import ChatCompletionRequest

# Load Mistral tokenizer

model_name = "mistral-nemo"

tokenizer = MistralTokenizer.from_model(model_name)

request = ChatCompletionRequest(
        messages=[
            SystemMessage(content="You are a helpful assistant"),
            UserMessage(content="What's the weather like today in Paris"),
            AssistantMessage(content="The weather is good."),
        ],
        model=model_name,
)

# tokenize each message & create corresponding mask
mask = []
tokens = []
for i, message in enumerate(request.messages):
    if isinstance(message, UserMessage):
        tokens += tokenizer.instruct_tokenizer.encode_user_message(message, None, False, False)[0]
        mask += [False] * len(tokens)
    elif isinstance(message, AssistantMessage):
        tokens += tokenizer.instruct_tokenizer.encode_assistant_message(message, False)
        mask += [True] * len(tokens)

While this now becomes more complex, we can still be sure that we don't have any whitespace / silent tokenization error since we map messages directly to tokens (and don't go over the text format).

Sign up or log in to comment