something is wrong with this model
Something is wrong with this model. I'm getting great outputs from the 7b model but not this one, and I'm using the same script. Please check the tokenizer or other configuration files...Not sure what it is.
Hey @ctranslate2-4you , can you elaborate more on this?
Sure, when I run it using the same exact script as the 7b version, it says it can't find the answer to a question. I'm posing a RAG type of question...single question and answer script, to test for my RAG application. No change in the parameters, inference logic or anything. With that being said, I am using the bitsandbytes library to do 4-bit quantization...that's the only possible thing I can think of that might make a difference...but it's strange that it would only affect the 13b model. Here is the prompt format I'm using:
prompt = f"""<|endoftext|><|user|>
{user_message}
<|assistant|>
"""
Notice I'm not using the annoying apply_chat_template
, which is because I just like seeing the formatting.
Anyways, here's the configuration information as well. As you can see, I've tried commenting/uncommenting doubleq_quant and flash attention 2...same result
bnb_bfloat16_settings = {
'tokenizer_settings': {
'torch_dtype': torch.bfloat16,
'trust_remote_code': True,
},
'model_settings': {
'torch_dtype': torch.bfloat16,
'quantization_config': BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
# bnb_4bit_use_double_quant=True,
),
'low_cpu_mem_usage': True,
'trust_remote_code': True,
'attn_implementation': "sdpa"
# 'attn_implementation': "flash_attention_2"
}
}