Model Overview
Model Summary
This model is a CLIP (Contrastive Language-Image Pre-training) neural network. CLIP revolutionizes image understanding by learning visual concepts from natural language descriptions found online. It's been trained on a massive dataset of image-text pairs, allowing it to excel at tasks like zero-shot image classification, image search based on text queries, and robust visual understanding. With CLIP, you can explore the power of aligning image and text representations within a shared embedding space.
Weights are released under the MIT License. Keras model code is released under the Apache 2 License.
Links
Installation
Keras and KerasCV can be installed with:
pip install -U -q keras-cv
pip install -U -q keras>=3
Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the Keras Getting Started page.
Presets
The following model checkpoints are provided by the Keras team. Full code examples for each are available below.
Preset name | Parameters | Description |
---|---|---|
clip-vit-base-patch16 | 149.62M | The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The model uses a patch size of 16 and input images of size (224, 224) |
clip-vit-base-patch32 | 151.28M | The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224) |
clip-vit-large-patch14 | 427.62M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224) |
clip-vit-large-patch14-336 | 427.94M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336) |
Example code
from keras import ops
import keras
from keras_cv.models.feature_extractor.clip import CLIPProcessor
from keras_cv.models import CLIP
processor = CLIPProcessor("vocab.json", "merges.txt")
# processed_image = transform_image("cat.jpg", 224)
tokens = processor(["mountains", "cat on tortoise", "house"])
model = CLIP.from_preset("clip-vit-base-patch32")
output = model({
"images": processed_image,
"token_ids": tokens['token_ids'],
"padding_mask": tokens['padding_mask']})
# optional if you need to pre process image
def transform_image(image_path, input_resolution):
mean = ops.array([0.48145466, 0.4578275, 0.40821073])
std = ops.array([0.26862954, 0.26130258, 0.27577711])
image = keras.utils.load_img(image_path)
image = keras.utils.img_to_array(image)
image = (
ops.image.resize(
image,
(input_resolution, input_resolution),
interpolation="bicubic",
)
/ 255.0
)
central_fraction = input_resolution / image.shape[0]
width, height = image.shape[0], image.shape[1]
left = ops.cast((width - width * central_fraction) / 2, dtype="int32")
top = ops.cast((height - height * central_fraction) / 2, dtype="int32")
right = ops.cast((width + width * central_fraction) / 2, dtype="int32")
bottom = ops.cast(
(height + height * central_fraction) / 2, dtype="int32"
)
image = ops.slice(
image, [left, top, 0], [right - left, bottom - top, 3]
)
image = (image - mean) / std
return ops.expand_dims(image, axis=0)
- Downloads last month
- 3