--- library_name: tf-keras license: apache-2.0 title: Video Vision Transformer on medmnist emoji: 🧑‍⚕️ colorFrom: red colorTo: green sdk: gradio app_file: app.py pinned: false --- ## Keras Implementation of Video Vision Transformer on medmnist This repo contains the model [to this Keras example on Video Vision Transformer](https://keras.io/examples/vision/vivit/). ## Background Information This example implements [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Arnab et al., a pure Transformer-based model for video classification. The authors propose a novel embedding scheme and a number of Transformer variants to model video clips. ## Datasets We use the [MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification](https://medmnist.com/) dataset. ## Training Parameters ``` # DATA DATASET_NAME = "organmnist3d" BATCH_SIZE = 32 AUTO = tf.data.AUTOTUNE INPUT_SHAPE = (28, 28, 28, 1) NUM_CLASSES = 11 # OPTIMIZER LEARNING_RATE = 1e-4 WEIGHT_DECAY = 1e-5 # TRAINING EPOCHS = 80 # TUBELET EMBEDDING PATCH_SIZE = (8, 8, 8) NUM_PATCHES = (INPUT_SHAPE[0] // PATCH_SIZE[0]) ** 2 # ViViT ARCHITECTURE LAYER_NORM_EPS = 1e-6 PROJECTION_DIM = 128 NUM_HEADS = 8 NUM_LAYERS = 8 ```