ViViT (Video Vision Transformer)

ViViT model as introduced in the paper ViViT: A Video Vision Transformer by Arnab et al. and first released in this repository.

Disclaimer: The team releasing ViViT did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

ViViT is an extension of the Vision Transformer (ViT) to video.

We refer to the paper for details.

Intended uses & limitations

The model is mostly meant to intended to be fine-tuned on a downstream task, like video classification. See the model hub to look for fine-tuned versions on a task that interests you.

How to use

For code examples, we refer to the documentation.

BibTeX entry and citation info

@misc{arnab2021vivit,
      title={ViViT: A Video Vision Transformer}, 
      author={Anurag Arnab and Mostafa Dehghani and Georg Heigold and Chen Sun and Mario LučiΔ‡ and Cordelia Schmid},
      year={2021},
      eprint={2103.15691},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Downloads last month
22,127
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for google/vivit-b-16x2-kinetics400

Finetunes
54 models

Spaces using google/vivit-b-16x2-kinetics400 12