nvidia/NV-Embed-v2 · High GPU Memory Usage with NVIDIA NV-Embed-v2 Compared to other models

Context

I am using the nvidia/NV-Embed-v2 embedding model from Hugging Face within a Retrieval-Augmented Generation (RAG) pipeline. The application loads documents, processes them into embeddings, and uses the embeddings for retrieval tasks. The same code works perfectly with other Hugging Face models, such as BAAI/bge-m3, which only uses about 3.4 GB of VRAM.

However, when I use nvidia/NV-Embed-v2, the VRAM usage spikes to the maximum capacity on my GPU, leading to a CUDA Out of Memory (OOM) error.

System Details

GPUs:
- GPU 0: NVIDIA RTX 4090 (24 GB VRAM)
- GPU 1: NVIDIA T400 (4 GB VRAM)
OS: Linux (Rocky Linux 9)
Python Version: 3.11
PyTorch Version: 2.2.0
Transformers Version: 4.42.4
LangChain: Latest Version
Model: nvidia/NV-Embed-v2

The Problem

When using nvidia/NV-Embed-v2:

VRAM usage for GPU 0 (RTX 4090) immediately jumps to its maximum capacity (~24 GB).
I receive the following error:
```
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. 
```
Observations:
- GPU 1 (T400) remains largely unused due to its low VRAM (4 GB), causing an imbalance.
- The same pipeline with BAAI/bge-m3 works flawlessly, consuming only ~3.4 GB of VRAM.
- I load only one document (pdf) size: 270ko - 6 pages of text .

Code Overview

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import ApertureDB
from PyPDF2 import PdfReader
from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Embedding Model
embeddings = HuggingFaceEmbeddings(
    model_name="nvidia/NV-Embed-v2", 
    model_kwargs={"trust_remote_code": True}
)

# Process Documents
all_docs = []
pdf_reader = PdfReader("example.pdf")
docs = [Document(page_content=page.extract_text() or "") for page in pdf_reader.pages]
all_docs.extend(docs)

# Split and Add to Vector Store
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=300)
segmented_docs = text_splitter.split_documents(all_docs)

vector_db = ApertureDB(embeddings=embeddings, descriptor_set="example_vector_db")
vector_db.add_documents(segmented_docs)

What I Have Tried

Reduced the Batch Size:
- Set batch_size=1 to minimize memory usage.
Used device_map="balanced_low_0":
- Attempted to offload parts of the model to the CPU.
Cleared GPU Memory Cache:
- Used torch.cuda.empty_cache() before and after model loading.

Despite all, the GPU 0 VRAM remains fully utilized, leading to the OOM error.

Why does nvidia/NV-Embed-v2 consume significantly more VRAM compared to other models like BAAI/bge-m3?
Is there a recommended way to load this model with optimized VRAM usage (e.g., offloading to CPU or reducing layers)?
How can I effectively balance the workload across GPUs (RTX 4090 and T400) while handling VRAM constraints?
Are there specific configurations for NV-Embed-v2 to reduce VRAM footprint for embedding tasks?

System Logs

Here is an example of my nvidia-smi output before the error using nvidia/NV-Embed-v2 :

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:16:00.0 Off |                  Off |
|  0%   37C    P8             15W /  480W |   24084MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA T400 4GB                Off |   00000000:AC:00.0  On |                  N/A |
| 38%   33C    P8             N/A /   31W |      12MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

using BAAI/bge-m3:


+-------------------------------------------------------------------+
| Processes:                                                        |
|  GPU   GI   CI        PID   Type   Process name        GPU Memory |
|        ID   ID                                         Usage      |
|===================================================================|
|    0   N/A  N/A    310496      C   python                 3430MiB |
+-------------------------------------------------------------------+

Thank you for your support and insights! 🚀