CUDA error when initialising model with text-generation-inference

#22

by AbRds - opened Dec 13, 2023

Dec 13, 2023

•

edited Dec 13, 2023

Hi everyone, I was trying to deploy the model using the text-generation-inference toolkit in a AWS EC2 G5.24xLarge with 96GB of GPU. (4 GPUs of 24GB each), but when the model is initialising I receive the following message (once per each GPU):

--torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 1 has a total capacty of 21.99 GiB of which 77.00 MiB is free. Process 27730 has 21.90 GiB memory in use. Of the allocated memory 21.47 GiB is allocated by PyTorch, and 42.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF--

I have used this command to launch the service:

sudo docker run --gpus all --shm-size 1g -p 8080:80 -v /dev/data:/data ghcr.io/huggingface/text-generation-inference:1.3.0 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --num-shard 4 --max-batch-total-tokens 1024000 --max-total-tokens 32000

Seems like pythorch is reserving GPU memory causing a failure in the load of the model but I don't know how to face this issue. Somebody can help me to understand the problem or how to figure it out?

Thanks in advance.

ArthurZ

Dec 18, 2023

Would try to quantize the model with this: https://huggingface.co./docs/text-generation-inference/conceptual/quantization or run it in float16. Not super familiar with TGI but you might need more memory for the max batch total token you are using

steilgedacht

Dec 18, 2023

What worked for me was to enable device_map="auto".
So in the line where you load the model change it to
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

This makes the model use all 4 GPUs

AbRds

Dec 18, 2023

What worked for me was to enable device_map="auto".
So in the line where you load the model change it to
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

This makes the model use all 4 GPUs

Hi, thanks for the response, how can I apply this change using the TGI?

thanks in advance.

ArthurZ

Dec 18, 2023

cc @Narsil for TGI!

AbRds

Dec 19, 2023

•

edited Dec 19, 2023

Hi, finally I was able to run the model along with TGI using an in-place quantisation technique (I've supposed my current setup is not enough to run the model), also I used the default value for the flag --max-total-tokens.

Here is the command I used in case it is useful for someone else:

sudo docker run -d --gpus all --shm-size 1g -p $port:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --sharded true --num-shard 4 --quantize eetq

AbRds changed discussion status to closed Dec 19, 2023

Narsil

Dec 21, 2023

Glad you made it work

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment