How to speed up inference?
#4
by
vegasscientific
- opened
I tried this on 2x A6000 48gb and it takes around 35s for a test image. I put it on H100 80gb and it still takes 25s for an image. Is there a vLLM configuration or other example to get faster inference speed? The Qwen API server is much faster - what configuration do they use?
In the example code they have:
You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
min_pixels = 2562828
max_pixels = 12802828
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels)
Setting smaller image size speeds it up.