How to Optimize Slow CPU Inference Speed

#2
by izhaohui - opened

Due to certain restrictions, I cannot use a GPU during deployment or access APIs that leverage GPUs on other devices. When using CPU embedding, processing an 800x800 image takes several minutes, which is extremely slow. Other CLIP models I’ve tested typically complete this in around 1 second. I’m not familiar with the model’s architecture, but I’d like to know if there are ways to optimize CPU inference speed. tks

after set flash_atten to None ,problem resolved,ths

izhaohui changed discussion status to closed

It feels like there is something wrong with your setup. The demo runs on CPU and completes the task in ~30-40 seconds, which is kinda expected. This model supports higher resolution - 512px compared to 256px in standard CLIP models - which results in 4x longer input sequence for the visual branch. Also, performance may be affected by long text inputs. Can you share the code and some data examples?
Generally, one good way to optimize CPU inference is to use ONNX. The conversion from PyTorch to ONNX is quite straightforward but may be hard if you do it for the first time. I'll convert the model to ONNX on the weekend and update the repo.

CPU: Intel 10500
Memory: 32GB
Runtime: Docker
OS: Debian

from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import requests
import torch
import sys
import time
name = "visheratin/mexma-siglip2"
path = "/xxxx/models" 
name = path
model = AutoModel.from_pretrained(name, torch_dtype='auto', trust_remote_code=True, device_map='cpu', attn_implementation=None)
tokenizer = AutoTokenizer.from_pretrained(name)
processor = AutoImageProcessor.from_pretrained(name)

begin = time.time()
with torch.inference_mode():
    img = Image.open("/mnt/xxx.jpg")
    img.thumbnail((800,800))
    img = processor(images=img, return_tensors="pt")["pixel_values"]
    text = tokenizer([*sys.argv[1:]], return_tensors="pt", padding=True)
    image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
    probs = image_logits.softmax(dim=-1).tolist()
    print(probs)
    print(f"cost {int(time.time() - begin)}")
  • case1:
    model = AutoModel.from_pretrained(name, torch_dtype='auto', trust_remote_code=True, device_map='cpu', attn_implementation=None)
    cost 3 seconds

  • case2:
    model = AutoModel.from_pretrained(name, torch_dtype='auto', trust_remote_code=True)
    cost abort 5 mins(319 seconds)

case1 and case2 use same image and command line args
command: python3 test.py key1 key2 key3

izhaohui changed discussion status to open

Here is the link to Colab with example on how to use ONNX version of the model. This version has more reasonable execution speed.

Thank you, I will try the ONNX model approach.

visheratin changed discussion status to closed

Sign up or log in to comment