How to speed up inference?

#4
by vegasscientific - opened

I tried this on 2x A6000 48gb and it takes around 35s for a test image. I put it on H100 80gb and it still takes 25s for an image. Is there a vLLM configuration or other example to get faster inference speed? The Qwen API server is much faster - what configuration do they use?

In the example code they have:

You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.

min_pixels = 2562828

max_pixels = 12802828

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels)

Setting smaller image size speeds it up.

Sign up or log in to comment