DeepSeek R1 AWQ
AWQ of the DeepSeek R1 model.
This quant modified some of the model code to fix the overflow issue when using float16.
Tested on vLLM with 8x H100, inference speed 5 tokens per second with batch size 1 and short prompt, 12 tokens per second when using moe_wna16
kernel.
If you are serving with vLLM, please either add --dtype float16
or use the new moe_wna16
kernel by using --quantization moe_wna16
.
- Downloads last month
- 670
Inference API (serverless) does not yet support model repos that contain custom code.
Model tree for cognitivecomputations/DeepSeek-R1-AWQ
Base model
deepseek-ai/DeepSeek-R1