File size: 2,516 Bytes

87d91da

---
license: mit
language:
- ar
- kn
- ar
- ka
- af
- kk
- am
- km
- ar
- ky
- ar
- ko
- as
- lo
- az
- ml
- az
- mr
- be
- mk
- bn
- my
- bs
- nl
- bg
- ca
- 'no'
- cs
- ne
- ku
- pl
- cy
- pt
- da
- ro
- de
- ru
- el
- sa
- en
- si
- eo
- sk
- et
- sl
- eu
- sd
- fi
- so
- fr
- es
- gd
- sr
- ga
- su
- gl
- sv
- gu
- sw
- ha
- ta
- he
- te
- hi
- th
- hr
- tr
- hu
- ug
- hy
- uk
- id
- ur
- is
- vi
- it
- xh
- jv
- zh
- ja
pipeline_tag: zero-shot-image-classification
tags:
- siglip2
- clip
- mexma
model-index:
  - name: mexma-siglip2
    results:
      - task:
          type: zero-shot retrieval
        dataset:
          name: Crossmodal-3600
          type: Crossmodal-3600
        metrics:
          - name: Image retrieval R@1
            type: Image retrieval R@1
            value: 62.54%
          - name: Text retrieval R@1
            type: Text retrieval R@1
            value: 59.99%
---

## Model Summary

MEXMA-SigLIP2 is a model that combines the [MEXMA](https://huggingface.co./facebook/MEXMA) multilingual text encoder and an image encoder from the 
[SigLIP2](https://huggingface.co./google/siglip2-so400m-patch16-512/) model. This allows us to get a high-performance CLIP model for 80 languages. 
MEXMA-SigLIP2 sets new state-of-the-art on the [Crossmodal-3600](https://google.github.io/crossmodal-3600/) dataset with 62.54% R@1 for image retrieval and 
59.99% R@1 for text retrieval.


## How to use

```
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import requests
import torch

model = AutoModel.from_pretrained("visheratin/mexma-siglip2", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip2")
processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip2")

img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
img = processor(images=img, return_tensors="pt")["pixel_values"]
img = img.to(torch.bfloat16).to("cuda")
with torch.inference_mode():
    text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda")
    image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
    probs = image_logits.softmax(dim=-1)
    print(probs)
```

## Acknowledgements

I thank [ML Collective](https://mlcollective.org/) for providing compute resources to train the model.