CUDA error when initialising model with text-generation-inference

#22
by AbRds - opened

Hi everyone, I was trying to deploy the model using the text-generation-inference toolkit in a AWS EC2 G5.24xLarge with 96GB of GPU. (4 GPUs of 24GB each), but when the model is initialising I receive the following message (once per each GPU):

--torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 1 has a total capacty of 21.99 GiB of which 77.00 MiB is free. Process 27730 has 21.90 GiB memory in use. Of the allocated memory 21.47 GiB is allocated by PyTorch, and 42.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF--

I have used this command to launch the service:

sudo docker run --gpus all --shm-size 1g -p 8080:80 -v /dev/data:/data ghcr.io/huggingface/text-generation-inference:1.3.0 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --num-shard 4 --max-batch-total-tokens 1024000 --max-total-tokens 32000

Seems like pythorch is reserving GPU memory causing a failure in the load of the model but I don't know how to face this issue. Somebody can help me to understand the problem or how to figure it out?

Thanks in advance.

Would try to quantize the model with this: https://huggingface.co./docs/text-generation-inference/conceptual/quantization or run it in float16. Not super familiar with TGI but you might need more memory for the max batch total token you are using

What worked for me was to enable device_map="auto".
So in the line where you load the model change it to
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

This makes the model use all 4 GPUs

What worked for me was to enable device_map="auto".
So in the line where you load the model change it to
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

This makes the model use all 4 GPUs

Hi, thanks for the response, how can I apply this change using the TGI?

thanks in advance.

Hi, finally I was able to run the model along with TGI using an in-place quantisation technique (I've supposed my current setup is not enough to run the model), also I used the default value for the flag --max-total-tokens.

Here is the command I used in case it is useful for someone else:

sudo docker run -d --gpus all --shm-size 1g -p $port:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --sharded true --num-shard 4 --quantize eetq
AbRds changed discussion status to closed

Glad you made it work

Sign up or log in to comment