Amazon Sagemaker deployment failing with CUDA OutOfMemory error

#10

by neelkapadia - opened 2 days ago

Discussion

neelkapadia

2 days ago

•

edited 1 day ago

Error Message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 
 #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m

Code (referred from example code when you click Deploy):

hub = {
    'HF_MODEL_ID':'deepseek-ai/DeepSeek-R1-Distill-Llama-70B',
    'SM_NUM_GPUS': json.dumps(1)
}



# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface",version="2.2.0"),
    env=hub,
    role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    container_startup_health_check_timeout=300,
  )

Config file from logs:

text_generation_launcher#033[0m#033[2m:#033[0m Args {
    model_id: "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: Some(
        1,
    ),
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "container-0.local",
    port: 8080,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/tmp",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
    disable_usage_stats: false,
    disable_crash_reports: false,
}

Tried on ml.g5.2xlarge, ml.g5.8xlarge, and ml.p4d.24xlarge. Getting the same error on all of them.

pagezyhf

1 day ago

Try with this. We shipped a new DLC with TGI v3 that is not yet referenced in Sagemaker SDK. Like this it should work. Change the region in the Image URI!
It should fit on g6.12xlarge and above.

import time
import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import json

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi3.0.1-gpu-py311-cu124-ubuntu22.04"
model_name = "deepseek-ai-deepseek-r1-distill-llama-70b" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

hub = {
'HF_MODEL_ID': 'deepseek-ai/DeepSeek-R1-Distill-Llama-70B',
'SM_NUM_GPUS': json.dumps(8),
'MESSAGES_API_ENABLED': "true",
}

model = HuggingFaceModel(
name=model_name,
env=hub,
role=role,
image_uri=image_uri
)

predictor = model.deploy(
initial_instance_count=1,
instance_type="ml.g6.48xlarge",
endpoint_name=model_name
)

pagezyhf

1 day ago

Thanks for flagging. We'll update the deploy snippet :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment