nvidia/Llama-3.1-Nemotron-70B-Instruct · Getting the model to run?

vkg

Nov 12

•

Hello all, wondering if any of you had to do something special to get the model to run? I have a 2xH100 GPUs, each with 80 GB VRAM but I keep getting CUDA out of memory errors when loading the model ("torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacity of 79.32 GiB of which 964.44 MiB is free. Process 18230 has 78.37 GiB memory in use. Of the allocated memory 77.79 GiB is allocated by PyTorch, and 511.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) when instantiating TELayerNormColumnParallelLinear when instantiating MLP when instantiating TransformerLayer).

Setting PYTORCH_CUDA_ALLOC_CONF as it suggests does not help.

Anyone ran across the same issue?

Here is my environment:

huggingface_hub version: 0.23.4
Platform: Linux-4.18.0-553.16.1.el8_10.x86_64-x86_64-with-glibc2.35
Python version: 3.10.12
Running in iPython ?: No
Running in notebook ?: No
Running in Google Colab ?: No
Token path ?: /opt/modeling/vkg/Llama-3.1-Nemotron/data/token
Has saved token ?: True
Who am I ?: vkg
Configured git credential helpers:
FastAI: N/A
Tensorflow: N/A
Torch: 2.3.0a0+40ec155e58.nv24.3
Jinja2: 3.1.3
Graphviz: 0.20.3
keras: N/A
Pydot: N/A
Pillow: 10.2.0
hf_transfer: N/A
gradio: N/A
tensorboard: N/A
numpy: 1.24.4
pydantic: 2.8.2
aiohttp: 3.9.3
ENDPOINT: https://huggingface.co.
HF_HUB_CACHE: /opt/modeling/vkg/Llama-3.1-Nemotron/data/hub
HF_ASSETS_CACHE: /opt/modeling/vkg/Llama-3.1-Nemotron/data/assets
HF_TOKEN_PATH: /opt/modeling/vkg/Llama-3.1-Nemotron/data/token
HF_HUB_OFFLINE: False
HF_HUB_DISABLE_TELEMETRY: False
HF_HUB_DISABLE_PROGRESS_BARS: None
HF_HUB_DISABLE_SYMLINKS_WARNING: False
HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
HF_HUB_DISABLE_IMPLICIT_TOKEN: False
HF_HUB_ENABLE_HF_TRANSFER: False
HF_HUB_ETAG_TIMEOUT: 10
HF_HUB_DOWNLOAD_TIMEOUT: 10

Thanks in advance.

arthrod

Nov 12

•

edited Nov 12

It took me a month to learn how to load its cousin (rewards). It only worked with a docker image for Triton inference server.

vkg

Nov 12

@arthrod , thank you for your response. For the Instruct model, I am using Nvidia's image, but getting the CUDA memory error nonetheless.

arthrod

Nov 23

Take a look at this issue https://github.com/NVIDIA/NeMo-Aligner/issues/351