Any one can use VLLM or any other engine support dynamic batch to run this with more than 1 GPU?

#1
by bash99 - opened

I can run this with the example python code.

But vllm alway complained that "ValueError: The input size is not aligned with the quantized weight shape.", which seems is no solution from vllm side
https://github.com/vllm-project/vllm/issues/5675

Or can I run convert it to GPTQ use groupsize of 64, I'm not sure does vllm support it.

I've find https://qwen.readthedocs.io/en/latest/quantization/gptq.html , in the Troubleshooting section, it said You can pad the origin model then quantize it, which require a very large memory. Why don't just release one that works?

Sign up or log in to comment