Improving Hugging Face Training Efficiency Through Packing with Flash Attention
β’
25
thanks
Thanks a lot
@julien-c
means a lot coming from you :)
@joaogante
I am adding a new architecture for this: https://github.com/huggingface/transformers/pull/29578
It supports both padding free and normal transformers
yeah, its just that people have not been using this for finetuning where it can give considerable memory savings. I guess the issue is the core design of HF transformers.
I am planning to release the code for this sometime soon :)