When will the GLM-4/Z1 series model support VLLM?
#6
by
David3698
- opened
EngineCore hit an exception: Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/torch/_dynamo/utils.py", line 2586, in run_node
return node.target(*args, **kwargs)
TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
Check this
Compile manually from the specified PR of the vLLM project because the fix has not yet been merged.
git clone https://github.com/vllm-project/vllm.git
cd vllm
git fetch origin pull/16618/head:pr-16618
VLLM_USE_PRECOMPILED=1 pip install --editable .
After compilation, you can normally perform inference with the model. Note, you must inject the environment variable VLLM_USE_V1=0 to avoid garbled output. Here are some example script:
GLM-4-0414.sh
CUDA_VISIBLE_DEVICES=0,1 VLLM_USE_V1=0 vllm serve /root/models/GLM-4-32B-0414 \
--served-model-name ChainBlock-Turbo \
--gpu-memory-utilization 0.95 \
--max-model-len 32768 \
--tensor-parallel-size 2 \
--host 0.0.0.0 \
--port 9997
GLM-Z1-0414.sh
CUDA_VISIBLE_DEVICES=2,3 VLLM_USE_V1=0 vllm serve /root/models/GLM-Z1-32B-0414 \
--served-model-name ChainBlock-Turbo-Reasoning \
--gpu-memory-utilization 0.95 \
--max-model-len 32768 \
--enable-reasoning --reasoning-parser granite \
--tensor-parallel-size 2 \
--host 0.0.0.0 \
--port 9998