High GPU Memory Usage with NVIDIA NV-Embed-v2 Compared to other models
Context
I am using the nvidia/NV-Embed-v2
embedding model from Hugging Face within a Retrieval-Augmented Generation (RAG) pipeline. The application loads documents, processes them into embeddings, and uses the embeddings for retrieval tasks. The same code works perfectly with other Hugging Face models, such as BAAI/bge-m3
, which only uses about 3.4 GB of VRAM.
However, when I use nvidia/NV-Embed-v2
, the VRAM usage spikes to the maximum capacity on my GPU, leading to a CUDA Out of Memory (OOM) error.
System Details
- GPUs:
- GPU 0: NVIDIA RTX 4090 (24 GB VRAM)
- GPU 1: NVIDIA T400 (4 GB VRAM)
- OS: Linux (Rocky Linux 9)
- Python Version: 3.11
- PyTorch Version: 2.2.0
- Transformers Version: 4.42.4
- LangChain: Latest Version
- Model:
nvidia/NV-Embed-v2
The Problem
When using nvidia/NV-Embed-v2
:
VRAM usage for GPU 0 (RTX 4090) immediately jumps to its maximum capacity (~24 GB).
I receive the following error:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB.
Observations:
- GPU 1 (T400) remains largely unused due to its low VRAM (4 GB), causing an imbalance.
- The same pipeline with
BAAI/bge-m3
works flawlessly, consuming only ~3.4 GB of VRAM. - I load only one document (pdf) size: 270ko - 6 pages of text .
Code Overview
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import ApertureDB
from PyPDF2 import PdfReader
from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Embedding Model
embeddings = HuggingFaceEmbeddings(
model_name="nvidia/NV-Embed-v2",
model_kwargs={"trust_remote_code": True}
)
# Process Documents
all_docs = []
pdf_reader = PdfReader("example.pdf")
docs = [Document(page_content=page.extract_text() or "") for page in pdf_reader.pages]
all_docs.extend(docs)
# Split and Add to Vector Store
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=300)
segmented_docs = text_splitter.split_documents(all_docs)
vector_db = ApertureDB(embeddings=embeddings, descriptor_set="example_vector_db")
vector_db.add_documents(segmented_docs)
What I Have Tried
- Reduced the Batch Size:
- Set
batch_size=1
to minimize memory usage.
- Set
- Used
device_map="balanced_low_0"
:- Attempted to offload parts of the model to the CPU.
- Cleared GPU Memory Cache:
- Used
torch.cuda.empty_cache()
before and after model loading.
- Used
Despite all, the GPU 0 VRAM remains fully utilized, leading to the OOM error.
- Why does
nvidia/NV-Embed-v2
consume significantly more VRAM compared to other models likeBAAI/bge-m3
? - Is there a recommended way to load this model with optimized VRAM usage (e.g., offloading to CPU or reducing layers)?
- How can I effectively balance the workload across GPUs (RTX 4090 and T400) while handling VRAM constraints?
- Are there specific configurations for NV-Embed-v2 to reduce VRAM footprint for embedding tasks?
System Logs
Here is an example of my nvidia-smi
output before the error using nvidia/NV-Embed-v2
:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:16:00.0 Off | Off |
| 0% 37C P8 15W / 480W | 24084MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA T400 4GB Off | 00000000:AC:00.0 On | N/A |
| 38% 33C P8 N/A / 31W | 12MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
using BAAI/bge-m3
:
+-------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|===================================================================|
| 0 N/A N/A 310496 C python 3430MiB |
+-------------------------------------------------------------------+
Thank you for your support and insights! ๐