Any one can use VLLM or any other engine support dynamic batch to run this with more than 1 GPU?

by bash99 - opened 8 days ago

8 days ago

I can run this with the example python code.

But vllm alway complained that "ValueError: The input size is not aligned with the quantized weight shape.", which seems is no solution from vllm side
https://github.com/vllm-project/vllm/issues/5675

Or can I run convert it to GPTQ use groupsize of 64, I'm not sure does vllm support it.

bash99

8 days ago

I've find https://qwen.readthedocs.io/en/latest/quantization/gptq.html , in the Troubleshooting section, it said You can pad the origin model then quantize it, which require a very large memory. Why don't just release one that works?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment