use Flash Attention

#8
by kakascode - opened

I attempted to use Flash Attention, but encountered the following error: NewModel does not support Flash Attention 2.0 yet. The model gte-multilingual-base does not yet support Flash Attention 2.0 ?

Alibaba-NLP org

Could you please paste the code for your model inference here? It would help us with debugging.

Could you please paste the code for your model inference here? It would help us with debugging.

model = AutoModel.from_pretrained(model_path, trust_remote_code=True, attn_implementation="flash_attention_2" )
ValueError: NewModel does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted
Alibaba-NLP org

The xformers has flash attention 2 kernel, and will dispatch to it when on the appropriate device and data type, ref to https://huggingface.co./Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers

Sign up or log in to comment