Taking way to long to generate a response

#46
by Idkkitsune - opened

Modified due to
UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co./docs/transformers/generation_strategies#default-text-generation-configuration )

but this happens to the original version too

The Code:

from transformers import pipeline, GenerationConfig

pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")

# Create a GenerationConfig instance with your desired settings
generation_config = GenerationConfig(
   max_new_tokens=2, do_sample=True, temperature=0.7, top_k=50, top_p=0.95
)

# Use the tokenizer's chat template to format each message
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "Say Hi!"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print("Done")
outputs = pipe(prompt, generation_config=generation_config)
print(outputs[0]["generated_text"])

For some reason it gets stuck at this specific line "outputs = pipe(prompt, generation_config=generation_config)", and generates the response after 30 minutes or so

If I ctrl+c, I get this

  File "c:/Users/Home/Desktop/testingenv/something.py", line 21, in <module>
    outputs = pipe(prompt, generation_config=generation_config)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\pipelines\text_generation.py", line 208, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\pipelines\base.py", line 1140, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\pipelines\base.py", line 1147, in run_single
    model_outputs = self.forward(model_inputs, **forward_params)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\pipelines\base.py", line 1046, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\pipelines\text_generation.py", line 271, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\generation\utils.py", line 1777, in generate
    return self.sample(
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\generation\utils.py", line 2874, in sample
    outputs = self(
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\models\mistral\modeling_mistral.py", line 1154, in forward
    outputs = self.model(
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\models\mistral\modeling_mistral.py", line 1039, in forward
    layer_outputs = decoder_layer(
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\models\mistral\modeling_mistral.py", line 754, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\models\mistral\modeling_mistral.py", line 652, in forward
    value_states = self.v_proj(hidden_states)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)

Is there any way to solve this, am quite new, first of all it doesn't give an error , its just like taking a long time to run. Thank you

All I can say is that I am facing the similar situation on my mobile rtx4090 too. Hope there is some insights on how to mitigate this issues. Not sure if there is any ways to quantize it, bitsandbytes and also fast attention are not working for me

I guess that's an issue with the bfloat16 datatype. I'm using Google Colab, and the V100 GPU gets stuck at model.generate() if AutoModelForCausalLM.from_pretrained(..., torch_dtype=torch.bfloat16) is set, while L4 (which has Ada Lovelace architecture) runs without problem under the same configuration.

The support for bfloat16 depends on the GPU architecture (and perhaps the CUDA/cuDNN version - clarification welcome).
According to this table you need an Ampere GPU (RTX 3k/ A100) to run inference with bfloat16.
(I have no idea about why @n094t23g 's RTX4090 refuses to run inference, though!)

PyTorch offers a way to check if your device supports bfloat16:

import torch
torch.cuda.is_bf16_supported()

And in case it returns False, you can set torch_dtype=torch.float16 instead of bfloat16 and it won't get stuck.
(I observed a slight change in the generated text between float16 and bfloat16, but idk how much it affects the model quality...)

Sign up or log in to comment