clip-ViT-B-32

This is the Image & Text model CLIP, which maps text and images to a shared vector space. For applications of the models, have a look in our documentation SBERT.net - Image Search

Usage

After installing sentence-transformers (pip install sentence-transformers), the usage of this model is easy:

from sentence_transformers import SentenceTransformer, util
from PIL import Image

#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')

#Encode an image:
img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))

#Encode text descriptions
text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])

#Compute cosine similarities 
cos_scores = util.cos_sim(img_emb, text_emb)
print(cos_scores)

See our SBERT.net - Image Search documentation for more examples how the model can be used for image search, zero-shot image classification, image clustering and image deduplication.

Performance

In the following table we find the zero-shot ImageNet validation set accuracy:

Model Top 1 Performance
clip-ViT-B-32 63.3
clip-ViT-B-16 68.1
clip-ViT-L-14 75.4

For a multilingual version of the CLIP model for 50+ languages have a look at: clip-ViT-B-32-multilingual-v1

Downloads last month
21
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.