allenai/OLMo-2-1124-13B-Instruct · something is wrong with this model

19 days ago

Something is wrong with this model. I'm getting great outputs from the 7b model but not this one, and I'm using the same script. Please check the tokenizer or other configuration files...Not sure what it is.

amanrangapur

Ai2 org 18 days ago

Hey @ctranslate2-4you , can you elaborate more on this?

ctranslate2-4you

17 days ago

Sure, when I run it using the same exact script as the 7b version, it says it can't find the answer to a question. I'm posing a RAG type of question...single question and answer script, to test for my RAG application. No change in the parameters, inference logic or anything. With that being said, I am using the bitsandbytes library to do 4-bit quantization...that's the only possible thing I can think of that might make a difference...but it's strange that it would only affect the 13b model. Here is the prompt format I'm using:

    prompt = f"""<|endoftext|><|user|>
{user_message}
<|assistant|>
"""

Notice I'm not using the annoying apply_chat_template, which is because I just like seeing the formatting.

Anyways, here's the configuration information as well. As you can see, I've tried commenting/uncommenting doubleq_quant and flash attention 2...same result

bnb_bfloat16_settings = {
    'tokenizer_settings': {
        'torch_dtype': torch.bfloat16,
        'trust_remote_code': True,
    },
    'model_settings': {
        'torch_dtype': torch.bfloat16,
        'quantization_config': BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_quant_type="nf4",
            # bnb_4bit_use_double_quant=True,
        ),
        'low_cpu_mem_usage': True,
        'trust_remote_code': True,
        'attn_implementation': "sdpa"
        # 'attn_implementation': "flash_attention_2"
    }
}