hezarai
/

vit-roberta-fa-image-captioning-flickr30k

Model card Files Files and versions Community

A Persian image captioning model constructed from a ViT + RoBERTa architecture trained on flickr30k-fa (created by Sajjad Ayoubi). The encoder (ViT) was initialized from https://huggingface.co./google/vit-base-patch16-224 and the decoder (RoBERTa) was initialized from https://huggingface.co./HooshvareLab/roberta-fa-zwnj-base .

Usage

pip install hezar

from hezar.models import Model

model = Model.load("hezarai/vit-roberta-fa-image-captioning-flickr30k")
captions = model.predict("example_image.jpg")
print(captions)

Downloads last month: 184

Inference Providers NEW

This model is not currently available via any of the supported Inference Providers.

The model cannot be deployed to the HF Inference API: The HF Inference API does not support image-to-text models for hezar library.

Dataset used to train hezarai/vit-roberta-fa-image-captioning-flickr30k

Collection including hezarai/vit-roberta-fa-image-captioning-flickr30k

Computer Vision

Computer vision models, datasets, etc. • 9 items • Updated Jul 4, 2024