How to Optimize Slow CPU Inference Speed

by izhaohui - opened 4 days ago

4 days ago

Due to certain restrictions, I cannot use a GPU during deployment or access APIs that leverage GPUs on other devices. When using CPU embedding, processing an 800x800 image takes several minutes, which is extremely slow. Other CLIP models I’ve tested typically complete this in around 1 second. I’m not familiar with the model’s architecture, but I’d like to know if there are ways to optimize CPU inference speed. tks

izhaohui

4 days ago

after set flash_atten to None ，problem resolved，ths

izhaohui changed discussion status to closed 4 days ago

visheratin

Owner 4 days ago

It feels like there is something wrong with your setup. The demo runs on CPU and completes the task in ~30-40 seconds, which is kinda expected. This model supports higher resolution - 512px compared to 256px in standard CLIP models - which results in 4x longer input sequence for the visual branch. Also, performance may be affected by long text inputs. Can you share the code and some data examples?
Generally, one good way to optimize CPU inference is to use ONNX. The conversion from PyTorch to ONNX is quite straightforward but may be hard if you do it for the first time. I'll convert the model to ONNX on the weekend and update the repo.

izhaohui

4 days ago

CPU: Intel 10500
Memory: 32GB
Runtime: Docker
OS: Debian

from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import requests
import torch
import sys
import time
name = "visheratin/mexma-siglip2"
path = "/xxxx/models" 
name = path
model = AutoModel.from_pretrained(name, torch_dtype='auto', trust_remote_code=True, device_map='cpu', attn_implementation=None)
tokenizer = AutoTokenizer.from_pretrained(name)
processor = AutoImageProcessor.from_pretrained(name)

begin = time.time()
with torch.inference_mode():
    img = Image.open("/mnt/xxx.jpg")
    img.thumbnail((800,800))
    img = processor(images=img, return_tensors="pt")["pixel_values"]
    text = tokenizer([*sys.argv[1:]], return_tensors="pt", padding=True)
    image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
    probs = image_logits.softmax(dim=-1).tolist()
    print(probs)
    print(f"cost {int(time.time() - begin)}")

case1:
model = AutoModel.from_pretrained(name, torch_dtype='auto', trust_remote_code=True, device_map='cpu', attn_implementation=None)
cost 3 seconds
case2:
model = AutoModel.from_pretrained(name, torch_dtype='auto', trust_remote_code=True)
cost abort 5 mins(319 seconds)

case1 and case2 use same image and command line args
command: python3 test.py key1 key2 key3

izhaohui changed discussion status to open 4 days ago

visheratin

Owner 4 days ago

Here is the link to Colab with example on how to use ONNX version of the model. This version has more reasonable execution speed.

izhaohui

4 days ago

Thank you, I will try the ONNX model approach.

visheratin changed discussion status to closed about 24 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment