DeepSeek-R1-Distill-Qwen-14B-FP8
FP8-quantized version of DeepSeek-R1-Distill-Qwen-14B, optimized for inference with vLLM. The quantization reduces the model's memory footprint by approximately 50%.
Model Overview
- Base Model: DeepSeek-R1-Distill-Qwen-14B
- Quantization: FP8 (weights and activations)
- Memory Reduction: ~50% (from 16-bit to 8-bit)
- License: MIT License (following original model's license)
Compression Details
Compressed using LLM Compressor with:
- 512 calibration samples from UltraChat
- Symmetric per-tensor quantization
- Applied to linear operators within transformer blocks
The compression script is available in compress.py
.
Requirements
- vLLM
- transformers
- torch
- accelerate
Note
This is an experimental compression of the model. Performance metrics and optimal usage parameters have not been thoroughly tested yet.
- Downloads last month
- 96
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
the model is not deployed on the HF Inference API.
Model tree for enferAI/DeepSeek-R1-Distill-Qwen-14B-FP8
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B