Amazon Sagemaker deployment failing with CUDA OutOfMemory error
Error Message:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU
#033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
Code (referred from example code when you click Deploy):
hub = {
'HF_MODEL_ID':'deepseek-ai/DeepSeek-R1-Distill-Llama-70B',
'SM_NUM_GPUS': json.dumps(1)
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="2.2.0"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)
Config file from logs:
text_generation_launcher#033[0m#033[2m:#033[0m Args {
model_id: "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: Some(
1,
),
quantize: None,
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "container-0.local",
port: 8080,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some(
"/tmp",
),
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
disable_usage_stats: false,
disable_crash_reports: false,
}
Tried on ml.g5.2xlarge, ml.g5.8xlarge, and ml.p4d.24xlarge. Getting the same error on all of them.
Try with this. We shipped a new DLC with TGI v3 that is not yet referenced in Sagemaker SDK. Like this it should work. Change the region in the Image URI!
It should fit on g6.12xlarge and above.
import time
import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import json
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()
image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi3.0.1-gpu-py311-cu124-ubuntu22.04"
model_name = "deepseek-ai-deepseek-r1-distill-llama-70b" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
hub = {
'HF_MODEL_ID': 'deepseek-ai/DeepSeek-R1-Distill-Llama-70B',
'SM_NUM_GPUS': json.dumps(8),
'MESSAGES_API_ENABLED': "true",
}
model = HuggingFaceModel(
name=model_name,
env=hub,
role=role,
image_uri=image_uri
)
predictor = model.deploy(
initial_instance_count=1,
instance_type="ml.g6.48xlarge",
endpoint_name=model_name
)
Thanks for flagging. We'll update the deploy snippet :)