DeepSeek-R1-Distill-Qwen-14B-FP8

FP8-quantized version of DeepSeek-R1-Distill-Qwen-14B, optimized for inference with vLLM. The quantization reduces the model's memory footprint by approximately 50%.

Model Overview

  • Base Model: DeepSeek-R1-Distill-Qwen-14B
  • Quantization: FP8 (weights and activations)
  • Memory Reduction: ~50% (from 16-bit to 8-bit)
  • License: MIT License (following original model's license)

Compression Details

Compressed using LLM Compressor with:

  • 512 calibration samples from UltraChat
  • Symmetric per-tensor quantization
  • Applied to linear operators within transformer blocks

The compression script is available in compress.py.

Requirements

  • vLLM
  • transformers
  • torch
  • accelerate

Note

This is an experimental compression of the model. Performance metrics and optimal usage parameters have not been thoroughly tested yet.

Downloads last month
96
Safetensors
Model size
14.8B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for enferAI/DeepSeek-R1-Distill-Qwen-14B-FP8

Quantized
(67)
this model