tiiuae/falcon-40b · Slow response time for 7b and 40b

Jul 13, 2023

Running Falcon-7B on A100 GPU, the response time takes around 10-12s and 40s for larger answers to generate.
How can we increase the output speed. Falcon-40B takes around 4-5 mins for a short answer.
What are the finetuning requirements for the 7B and 40B?

cmp-nct

Jul 13, 2023

•

edited Jul 13, 2023

falcon_print_timings:        load time =  1950.86 ms
falcon_print_timings:      sample time =    20.62 ms /    90 runs   (    0.23 ms per token,  4365.33 tokens per second)
falcon_print_timings: batch eval time =  1210.28 ms /   409 tokens (    2.96 ms per token,   337.94 tokens per second)
falcon_print_timings:        eval time =  1881.62 ms /    89 runs   (   21.14 ms per token,    47.30 tokens per second)
falcon_print_timings:       total time =  3142.62 ms

As a 7B reference, that's on a 4090 (closer to H100) but also on 3090 (that's A100 speed) the speed is about the same
409 tokens prompt and 90 tokens response take around 3.1 seconds. That's ~6 bit, It's 20% slower on full precision.
Maybe you have a problem on the CPU side, some swapping while loading or loading from a hdd etc ?

Here a 40B reference:

falcon_print_timings:        load time =  5666.06 ms
falcon_print_timings:      sample time =    13.84 ms /    61 runs   (    0.23 ms per token,  4408.47 tokens per second)
falcon_print_timings: batch eval time =  4116.55 ms /   409 tokens (   10.06 ms per token,    99.36 tokens per second)
falcon_print_timings:        eval time =  3561.89 ms /    60 runs   (   59.36 ms per token,    16.85 tokens per second)
falcon_print_timings:       total time =  7720.06 ms

kartik99

Jul 14, 2023

falcon_print_timings:        load time =  1950.86 ms
falcon_print_timings:      sample time =    20.62 ms /    90 runs   (    0.23 ms per token,  4365.33 tokens per second)
falcon_print_timings: batch eval time =  1210.28 ms /   409 tokens (    2.96 ms per token,   337.94 tokens per second)
falcon_print_timings:        eval time =  1881.62 ms /    89 runs   (   21.14 ms per token,    47.30 tokens per second)
falcon_print_timings:       total time =  3142.62 ms

As a 7B reference, that's on a 4090 (closer to H100) but also on 3090 (that's A100 speed) the speed is about the same
409 tokens prompt and 90 tokens response take around 3.1 seconds. That's ~6 bit, It's 20% slower on full precision.
Maybe you have a problem on the CPU side, some swapping while loading or loading from a hdd etc ?

Here a 40B reference:

falcon_print_timings:        load time =  5666.06 ms
falcon_print_timings:      sample time =    13.84 ms /    61 runs   (    0.23 ms per token,  4408.47 tokens per second)
falcon_print_timings: batch eval time =  4116.55 ms /   409 tokens (   10.06 ms per token,    99.36 tokens per second)
falcon_print_timings:        eval time =  3561.89 ms /    60 runs   (   59.36 ms per token,    16.85 tokens per second)
falcon_print_timings:       total time =  7720.06 ms

for smaller tokens the average speed for the 7B is around 3-4 seconds now running on 2 A100 GPUs, but for 40B its still really slow (and failes to answer properly), checked the GPU and CPU usage and it seems to be fine
Running it on 24 vCPUs VM

cmp-nct

Jul 14, 2023

•

edited Jul 14, 2023

falcon_print_timings:        load time =  1950.86 ms
falcon_print_timings:      sample time =    20.62 ms /    90 runs   (    0.23 ms per token,  4365.33 tokens per second)
falcon_print_timings: batch eval time =  1210.28 ms /   409 tokens (    2.96 ms per token,   337.94 tokens per second)
falcon_print_timings:        eval time =  1881.62 ms /    89 runs   (   21.14 ms per token,    47.30 tokens per second)
falcon_print_timings:       total time =  3142.62 ms
As a 7B reference, that's on a 4090 (closer to H100) but also on 3090 (that's A100 speed) the speed is about the same
409 tokens prompt and 90 tokens response take around 3.1 seconds. That's ~6 bit, It's 20% slower on full precision.
Maybe you have a problem on the CPU side, some swapping while loading or loading from a hdd etc ?

Here a 40B reference:
falcon_print_timings:        load time =  5666.06 ms
falcon_print_timings:      sample time =    13.84 ms /    61 runs   (    0.23 ms per token,  4408.47 tokens per second)
falcon_print_timings: batch eval time =  4116.55 ms /   409 tokens (   10.06 ms per token,    99.36 tokens per second)
falcon_print_timings:        eval time =  3561.89 ms /    60 runs   (   59.36 ms per token,    16.85 tokens per second)
falcon_print_timings:       total time =  7720.06 ms
for smaller tokens the average speed for the 7B is around 3-4 seconds now running on 2 A100 GPUs, but for 40B its still really slow (and failes to answer properly), checked the GPU and CPU usage and it seems to be fine
Running it on 24 vCPUs VM

I'm not sure what you mean by average speed, you'd need to measure speed similar as I quoted. As in tokens/second generation and tokens/sec prompt processing.
Try the ggllm.cpp project if you don't get proper speed using python, it's quite simple to get running and has far more flexibility in terms of configuration and quantization.
2xA100 would be quite an overkill for it, you don't need that much vram. There is no benefit I'd know to inference it at 16 bit precision, you get the same responses at 6K which is a fraction in size.

kiranr

Jul 18, 2023

There is an issue with not using past_key_values in the 7 and 40b model code. as mentioned here https://huggingface.co./tiiuae/falcon-40b/discussions/48. The issue is fixed in this pr in transformers https://github.com/huggingface/transformers/pull/24523. you can try using that.

kartik99

Jul 18, 2023

There is an issue with not using past_key_values in the 7 and 40b model code. as mentioned here https://huggingface.co./tiiuae/falcon-40b/discussions/48. The issue is fixed in this pr in transformers https://github.com/huggingface/transformers/pull/24523. you can try using that.

Thanks

Tron2060

Aug 3, 2023

@kartik99 I think there also has some issue with LayerNorm, it outputs float32 but the input is bfloat16. It can influence memory consumption and inference speed.