Slow inference performance when using nomic-embed-text-v1.5
Hello there,
After considering multiple aspects of this model, we thought to give it a shot over bge-large-en. The first observation is that it is running pretty slow even when running on GPU. My code looks like this :
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from models.semantic_search_gen import SemanticSearchGen
class NomicEmbedText(SemanticSearchGen):
"""Implementation to use vector embedding model nomic-embed-text-v1.5"""
def __init__(self):
"""
Constructor to initialize the model
"""
super().__init__("nomic-embed-text-v1.5")
self.tokenizer = AutoTokenizer.from_pretrained(self.model_path)
self.model = AutoModel.from_pretrained(self.model_path, trust_remote_code=True, safe_serialization=True)
self.model.eval()
self.model.max_seq_length = self.model_conf["token_limit"]
self.token_limit = self.model.max_seq_length
def mean_pooling(self, model_output, attention_mask):
"""
Implementation to calculate embeddings after mean pooling
:param model_output:
:param attention_mask:
:return: Vector embeddings after mean pooling
"""
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def gen_embeddings(self, text, prompt=False, text_type="search_document"):
"""
Generate vector embeddings
:param text: input text to generate the embeddings
:param prompt: Not applicable here
:param text_type: Default is "search_document:" to generate the embeddings for search document. If embedding is getting generated for search query then it needs to passed here
:return: Vector embeddings in the flot values list
"""
if text_type == "search_document":
instruction = "search_document: "
else:
instruction = "search_query: "
encoded_input = self.tokenizer(instruction + text, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = self.model(**encoded_input)
embeddings = self.mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
# Convert embeddings from Tensor to NumPy and return an array of floats
embeddings = embeddings.numpy()[0]
return embeddings
def get_model(self):
return self.model
Can the slowness be due to pooling calculation or not utilizing GPUs while calculating the embeddings?
Due to some conversion issue, I was not able to run it via sentence-transformers
as it was giving me some torch conversion related stacktrace.
Thanks!
Hello!
I can think of two causes here:
- (Most likely) Nomic's tokenizer accepts much longer inputs than bge-large-en-v1.5: 8192 instead of 512. This means that the model has to process a ton more tokens, and most encoder models get exponentially slower the longer the inputs, so this is a very likely cause. If you don't want to get increased inference times, you might want to set
tokenizer.model_max_length = 512
and then try to test whether you get improved performance. - (Less likely) bge-large-en-v1.5 via SentenceTransformers automatically does batching (32 samples per inference by default, I believe). This is quite a bit faster than doing 32 inferences of 1 sample.
Also,
Due to some conversion issue, I was not able to run it via sentence-transformers as it was giving me some torch conversion related stacktrace.
This is a shame :/ Could you post the stacktrace when you run:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
sentences = ['search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten']
embeddings = model.encode(sentences)
print(embeddings)
- Tom Aarsen
Hi Tom,
Thanks for pointing the likely causes behind the slowness.
I am going to try tokenizer.model_max_length = 512
and will update you shortly.
Regarding the inability to use SentenceTransformer, here is the stacktrace which I get whenever I use SentenceTransformer for nomic :
/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
A new version of the following files was downloaded from https://huggingface.co./nomic-ai/nomic-bert-2048:
- configuration_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co./nomic-ai/nomic-bert-2048:
- modeling_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
<All keys matched successfully>
Fatal Python error: Aborted
Thread 0x00000002a72ff000 (most recent call first):
File "/opt/homebrew/Cellar/[email protected]/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 331 in wait
File "/opt/homebrew/Cellar/[email protected]/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 629 in wait
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
File "/opt/homebrew/Cellar/[email protected]/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/opt/homebrew/Cellar/[email protected]/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1002 in _bootstrap
<Truncated some other threads to show the main culprit, which is below>
Current thread 0x00000001f59e2500 (most recent call first):
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160 in convert
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 805 in _apply
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780 in _apply
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780 in _apply
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780 in _apply
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780 in _apply
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1174 in to
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py", line 215 in __init__
File "/Users/umesh/git-repos/genai_search/models/impl/nomic_embed_text_v1.py", line 18 in __init__
File "/Users/umesh/git-repos/genai_search/models/model_factory.py", line 28 in __getattr__
File "/Users/umesh/git-repos/genai_search/streaming_processor.py", line 315 in apply_cleanup_and_store
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/util.py", line 81 in wrapper
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/rdd.py", line 1745 in processPartition
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/rdd.py", line 828 in func
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/rdd.py", line 5405 in pipeline_func
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/rdd.py", line 5405 in pipeline_func
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/rdd.py", line 5405 in pipeline_func
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 820 in process
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 830 in main
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 74 in worker
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193 in manager
File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 218 in <module>
File "<frozen runpy>", line 88 in _run_code
File "<frozen runpy>", line 198 in _run_module_as_main
pip list | grep -ia transformer
sentence-transformers 2.6.1
transformers 4.37.0
Thanks!
Here is the torch version :
pip list | grep torch
torch 2.4.0
torchvision 0.19.0
Unfortunately setting tokenizer.model_max_length = 512
doesn't seem to give any performance boost in my case.
BTW thanks for asking about the SentenceTransformer
issue I revisited it and was able to fix it by passing device
parameter. Strangely the above mentioned issue in thread dump is gone when using below lines, earlier I was not passing device param :
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model = SentenceTransformer(self.model_path, trust_remote_code=True, device=device)
Also I must say SentenceTransformer way more
faster than plain Transformers, I still don't know why ;)
Thanks!
From the code above, it doesn't seem like you are putting either the model or inputs on the GPU which could explain the slowness of transformers. IIRC, sentence transformers handles a lot of this especially when passing device
. To verify this, you check the output of nvidia-smi
when running the transformers code and it should show you some usage while your code is running.