How to speed up inference?

by vegasscientific - opened 15 days ago

15 days ago

I tried this on 2x A6000 48gb and it takes around 35s for a test image. I put it on H100 80gb and it still takes 25s for an image. Is there a vLLM configuration or other example to get faster inference speed? The Qwen API server is much faster - what configuration do they use?

yaneivan

about 6 hours ago

In the example code they have:

You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.

min_pixels = 2562828

max_pixels = 12802828

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels)

Setting smaller image size speeds it up.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment