Inference speed
#2
by
rmihaylov
- opened
Great work! How fast is the inference? They say QLora is slow at inference, is it true?
Ya pretty slow - took like 2 minutes to run the 40b model for 200 generated tokens. 7b variant is faster.
Just shared this finetuning: https://github.com/rmihaylov/falcontune . 5-7 times faster inference in 4-bit compared to QLora 4-bit
I also do inference here in 4-bit.
Yes, I know, but the forward computation when inference is not using cuda/triton kernels, that is why it is slow.
Is there a low-code way to move the inference to triton kernels? Fwiw, the inference is happening on the cuda device in my code.
Not implemented yet in bitsandbytes
dfurman
changed discussion status to
closed