use Flash Attention
#8
by
kakascode
- opened
I attempted to use Flash Attention, but encountered the following error: NewModel does not support Flash Attention 2.0 yet. The model gte-multilingual-base does not yet support Flash Attention 2.0 ?
Could you please paste the code for your model inference here? It would help us with debugging.
Could you please paste the code for your model inference here? It would help us with debugging.
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, attn_implementation="flash_attention_2" )
ValueError: NewModel does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted
The xformers
has flash attention 2 kernel, and will dispatch to it when on the appropriate device and data type, ref to https://huggingface.co./Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers