Taking way to long to generate a response
Modified due to
UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co./docs/transformers/generation_strategies#default-text-generation-configuration )
but this happens to the original version too
The Code:
from transformers import pipeline, GenerationConfig
pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")
# Create a GenerationConfig instance with your desired settings
generation_config = GenerationConfig(
max_new_tokens=2, do_sample=True, temperature=0.7, top_k=50, top_p=0.95
)
# Use the tokenizer's chat template to format each message
messages = [
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "Say Hi!"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print("Done")
outputs = pipe(prompt, generation_config=generation_config)
print(outputs[0]["generated_text"])
For some reason it gets stuck at this specific line "outputs = pipe(prompt, generation_config=generation_config)", and generates the response after 30 minutes or so
If I ctrl+c, I get this
File "c:/Users/Home/Desktop/testingenv/something.py", line 21, in <module>
outputs = pipe(prompt, generation_config=generation_config)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\pipelines\text_generation.py", line 208, in __call__
return super().__call__(text_inputs, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\pipelines\base.py", line 1140, in __call__
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\pipelines\base.py", line 1147, in run_single
model_outputs = self.forward(model_inputs, **forward_params)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\pipelines\base.py", line 1046, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\pipelines\text_generation.py", line 271, in _forward
generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\generation\utils.py", line 1777, in generate
return self.sample(
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\generation\utils.py", line 2874, in sample
outputs = self(
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\models\mistral\modeling_mistral.py", line 1154, in forward
outputs = self.model(
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\models\mistral\modeling_mistral.py", line 1039, in forward
layer_outputs = decoder_layer(
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\models\mistral\modeling_mistral.py", line 754, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\transformers\models\mistral\modeling_mistral.py", line 652, in forward
value_states = self.v_proj(hidden_states)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "c:\Users\Home\Desktop\testingenv\myenv\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
Is there any way to solve this, am quite new, first of all it doesn't give an error , its just like taking a long time to run. Thank you
All I can say is that I am facing the similar situation on my mobile rtx4090 too. Hope there is some insights on how to mitigate this issues. Not sure if there is any ways to quantize it, bitsandbytes and also fast attention are not working for me
I guess that's an issue with the bfloat16
datatype. I'm using Google Colab, and the V100 GPU gets stuck at model.generate()
if AutoModelForCausalLM.from_pretrained(..., torch_dtype=torch.bfloat16)
is set, while L4 (which has Ada Lovelace architecture) runs without problem under the same configuration.
The support for bfloat16
depends on the GPU architecture (and perhaps the CUDA/cuDNN version - clarification welcome).
According to this table you need an Ampere GPU (RTX 3k/ A100) to run inference with bfloat16
.
(I have no idea about why
@n094t23g
's RTX4090 refuses to run inference, though!)
PyTorch offers a way to check if your device supports bfloat16:
import torch
torch.cuda.is_bf16_supported()
And in case it returns False
, you can set torch_dtype=torch.float16
instead of bfloat16
and it won't get stuck.
(I observed a slight change in the generated text between float16
and bfloat16
, but idk how much it affects the model quality...)