Problems with flash-attention2

#13
by omaer0 - opened

According to the model card

If you want faster inference using flash-attention2, you need to install these dependencies:

pip install packaging ninja
pip install flash-attn==v2.1.1 --no-build-isolation
pip install git+https://github.com/HazyResearch/[email protected]#subdirectory=csrc/rotary

Now flash-attention2 seems to be mandatory because in modeling_flash_llama.py it is written

try:
    from flash_attn.flash_attn_interface import (
        flash_attn_kvpacked_func,
        ...
except ImportError:
    flash_attn_v2_installed = False
    raise ImportError('Please install Flash Attention: `pip install flash-attn --no-build-isolation`')

Question 1) Is it possible to use leo-hessianai-13b-chat (and 7b-chat) without flash-attention2? How?

When I run the above pip install commands with the recent torch version:  2.2 / cuda version:  cu121,
there is a library mismatch with flash-attn==v2.1.1 with an undefined symbol

import torch; from flash_attn import flash_attn_func
Traceback (most recent call last):
   import flash_attn_2_cuda as flash_attn_cuda 
site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: at::_ops::_pad_enum::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, long, c10::optional<double>)

(the error output was passed through c++filt)

Question 2) with what pytorch/cuda version was leo-hessianai generated? What versions of flash-attn2 can be used? Are you using cxx11abiFALSE or cxx11abiTRUE (see https://github.com/Dao-AILab/flash-attention/issues/457 )?

I tried the latest flash-attn==2.5.1.post1 and a couple of earlier pytorch/cuda versions, but without success (a similar issue is mentioned in https://github.com/Dao-AILab/flash-attention/issues/836, but I do not run docker; a similar issue is mentioned in https://github.com/Dao-AILab/flash-attention/issues/667#issuecomment-1816039443 but I did not suceed using the versions mentioned there)

LAION LeoLM org

By now you just need to use trust_remote_code=False or leave the argument away entirely. This model is full, compatible with the flash attention implementation in transformers.

Sign up or log in to comment