FIM tokens

#4
by msf0 - opened

Hey Bartowski, thanks so much for your GGUFs. I am happily using many of them.

I'm wondering if this GGUF is missing the FIM tokens. Can this be set with gguf_set_metadata.py? I'm wondering if you'd know the arguments to run it with, so I can apply it to the copy I already have saved

Or rather gguf_new_metadata.py

Oh sorry I missed this!

I don't think it's missing them but I'll double check

The tokens are present, but tokenizer.ggml.fim_{pre,suf,mid}_token_id is not set in the metadata, required for the llama.cpp /infill endpoint. It's probably no help to you, @msf0 , but you can construct a FIM prompt and use the normal llama.cpp completion endpoint to produce proper FIM completions. That's how I've been using this model for FIM.

[SUFFIX]{suffix}[PREFIX]{prefix}

Note the lack of [MIDDLE] token at the end. If you add one it doesn't respond properly. I suspect this was a typo in Codestral's framework and it was mistakenly trained without a FIM MID token. That's really unfortunate, because even if FIM tokens are added to the metadata it still won't work with /infill. For example, theoretically this should set the metadata for your local copy:

gguf_new_metadata.py --special-token fim_suf [SUFFIX] --special-token fim_pre [PREFIX] --special-token fim_mid [MIDDLE] input.gguf output.gguf

Then use --spm-infill when running the server since this model's FIM wants suffix first. However, llama.cpp will put the MID token at the end, and so FIM won't work correctly. As far as I can tell there's no way to stop llama.cpp from appending the MID token short of modifying the llama.cpp sources. You can't use a -1 "null" token for MID because then /infill refuses to run.

Thank you so much for the detailed response, @wellons !

I've been using the Qwen 2.5 Coder LLMs for autocompletion with ggerganov's llama.vim, and a copy of the Codestral here just by chat through llama.cpp's web interface. I got a bit interested again in trying to get Codestral to work for infill after seeing how it scores in lmarena.ai's Copilot Arena.

I don't think I could figure out how to change the vimscript to replicate your first suggestion, but if it's nothing to you, do you think the llama.cpp modification could be as little as a one-line change? I'd be willing to keep a separate build just for this even if it's hacky. Otherwise, please don't bother. Thank you for your response

I'm not positive if remaking this quant would help, I'm happy to try to get the infill tokens but will need to figure out how to properly get them

@bartowski , please don't bother just for me. I was just wondering if there was a quick fix. And maybe we'll see a new Codestral from Mistral soon anyways. Thank you again!

msf0 changed discussion status to closed

I was going to suggest simply using DeepSeek Coder, Qwen Coder, or Granite Code instead. Codestral FIM is mediocre, and as this shows v0.1 wasn't even trained properly. The FIM outputs from the other three models is markedly better. IMHO, DeepSeek has the best FIM training of all, and it's my primary recommendation. The main problem with Granite is the larger models only sport a tiny context.

However, I thought it would be a neat challenge to get /infill working with Codestral anyway, especially since I expected it could be done by deleting one line of code. Turns out --spm-infill has been broken since October 24th when it started appending a BOS token to the FIM prompt. Nobody seems to have noticed. (It's times like this I feel like I'm the only person on the planet using the FIM training for anything!) You'll still need the --spm-infill option. So it ended up requiring two changes:

https://github.com/skeeto/llama.cpp/tree/codestral-fim-hacks

You'll still need to modify the metadata in addition to these llama.cpp changes, after which /infill works for me. Like I said before, I skip /infill and construct the FIM prompts myself, which lets me fix FIM for other models, too, without modifying llama.cpp. So I don't intend to maintain these changes.

Thank you again for the response!

I tried out the code, including first modifying the GGUF, and using llama.vim, though a first autocompletion is shown and can be accepted, on the second, I get:

terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 32768)
Aborted (core dumped)

I will continue using Qwen Coder for now. Thank you for taking the time to show me on GitHub!

For the sake of curiosity, I want to understand that error. My changes were simply eliminating two std::vector::push_back calls, to prevent two unwanted tokens, and couldn't have produced, on its own, that range check failure (an overflow subtracting 1 from 0, i.e. it assumed a non-empty vector when it was empty). I suspect it's revealed an unrelated llama.cpp bug.

I originally tested with Clang and Metal inference. Your abort indicates libstdc++ (Linux?) std::vector::at, so I compiled a debug build with GCC, and -D_GLIBCXX_DEBUG for good measure, CPU inference, and I still cannot reproduce the error (with Q8_0). I lack the VRAM to test CUDA or Vulkan inference with Codestral. For completeness, I also compiled and ran an MSVC debug build, still no errors. It's possible the error might be limited to certain quants, and so I could never reproduce with Q8_0.

If you could tell me: (1) what Codestral quant, and (2) what inferencing (CPU, etc.) produced that error, it might enlighten me. A backtrace would be informative, too, but if you can't produce one trivially don't worry about it.

I am using:
https://huggingface.co./bartowski/Codestral-22B-v0.1-GGUF/blob/main/Codestral-22B-v0.1-Q4_K_M.gguf

with the Vulkan backend on Linux, with an AMD graphics card.

With commit:
https://github.com/ggerganov/llama.cpp/commit/716bd6dec3e044e5c325386b5b0483392b24cefe

I, before given your gguf_new_metadata.py command and your llama.cpp modifications, tried setting those special tokens and attempting some autocompletions with llama.vim, and received the identical output and failure. That is, with or without the llama.cpp modifications you showed me, I get the same result (I should've mentioned this earlier)

I don't know how to produce a backtrace with a C++ program, but I can run a given command if you'd like

That's all the detail I needed. Thanks for the followup info!

Happy to help, and happy New Year!

I don't know if it makes a difference, but actually it crashes the same way sometimes after the very first request. (With llama.vim, I have the auto autocompletions set to off, and press the keyboard shortcut to request one)

Sign up or log in to comment