{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Video Classification with a CNN-RNN Architecture" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Original Author:** Sayak Paul \n", "**Date created:** 2021/05/28 \n", "**Last modified:** 2021/06/05 \n", "**Description:** Training a video classifier with transfer learning and a recurrent model on the UCF101 dataset. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example demonstrates video classification, an important use-case with applications in recommendations, security, and so on. We will be using the UCF101 dataset to build our video classifier. The dataset consists of videos categorized into different actions, like cricket shot, punching, biking, etc. This dataset is commonly used to build action recognizers, which are an application of video classification.\n", "\n", "A video consists of an ordered sequence of frames. Each frame contains spatial information, and the sequence of those frames contains temporal information. To model both of these aspects, we use a hybrid architecture that consists of convolutions (for spatial processing) as well as recurrent layers (for temporal processing). Specifically, we'll use a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) consisting of GRU layers. This kind of hybrid architecture is popularly known as a CNN-RNN." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from tensorflow_docs.vis import embed\n", "from tensorflow import keras\n", "from imutils import paths\n", "\n", "import matplotlib.pyplot as plt\n", "import tensorflow as tf\n", "import pandas as pd\n", "import numpy as np\n", "import imageio\n", "import cv2\n", "import os" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "IMG_SIZE = 224\n", "BATCH_SIZE = 64\n", "EPOCHS = 12\n", "\n", "MAX_SEQ_LENGTH = 20\n", "NUM_FEATURES = 2048" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Collection \n", "\n", "In order to keep the runtime of this example relatively short, we will be using a subsampled version of the original UCF101 dataset. You can refer to this notebook to know how the subsampling was done." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "!wget -q https://git.io/JGc31 -O ucf101_top5.tar.gz\n", "!tar xf ucf101_top5.tar.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Preparation\n", "\n", "*P.S. I already did the preparation and saved it to npy files in order to make the training faster if you want to skip data preparation part.*" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total videos for training: 594\n", "Total videos for testing: 224\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
video_nametag
302v_Punch_g17_c04.aviPunch
227v_PlayingCello_g24_c04.aviPlayingCello
335v_Punch_g22_c05.aviPunch
115v_CricketShot_g25_c05.aviCricketShot
64v_CricketShot_g17_c02.aviCricketShot
74v_CricketShot_g19_c01.aviCricketShot
464v_ShavingBeard_g24_c01.aviShavingBeard
565v_TennisSwing_g21_c02.aviTennisSwing
313v_Punch_g19_c01.aviPunch
557v_TennisSwing_g19_c06.aviTennisSwing
\n", "
" ], "text/plain": [ " video_name tag\n", "302 v_Punch_g17_c04.avi Punch\n", "227 v_PlayingCello_g24_c04.avi PlayingCello\n", "335 v_Punch_g22_c05.avi Punch\n", "115 v_CricketShot_g25_c05.avi CricketShot\n", "64 v_CricketShot_g17_c02.avi CricketShot\n", "74 v_CricketShot_g19_c01.avi CricketShot\n", "464 v_ShavingBeard_g24_c01.avi ShavingBeard\n", "565 v_TennisSwing_g21_c02.avi TennisSwing\n", "313 v_Punch_g19_c01.avi Punch\n", "557 v_TennisSwing_g19_c06.avi TennisSwing" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df = pd.read_csv(\"train.csv\")\n", "test_df = pd.read_csv(\"test.csv\")\n", "\n", "print(f\"Total videos for training: {len(train_df)}\")\n", "print(f\"Total videos for testing: {len(test_df)}\")\n", "\n", "train_df.sample(10)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# The following two methods are taken from this tutorial:\n", "# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub\n", "\n", "\n", "def crop_center_square(frame):\n", " y, x = frame.shape[0:2]\n", " min_dim = min(y, x)\n", " start_x = (x // 2) - (min_dim // 2)\n", " start_y = (y // 2) - (min_dim // 2)\n", " return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]\n", "\n", "\n", "def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):\n", " cap = cv2.VideoCapture(path)\n", " frames = []\n", " try:\n", " while True:\n", " ret, frame = cap.read()\n", " if not ret:\n", " break\n", " frame = crop_center_square(frame)\n", " frame = cv2.resize(frame, resize)\n", " frame = frame[:, :, [2, 1, 0]]\n", " frames.append(frame)\n", "\n", " if len(frames) == max_frames:\n", " break\n", " finally:\n", " cap.release()\n", " return np.array(frames)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def build_feature_extractor():\n", " feature_extractor = keras.applications.InceptionV3(\n", " weights=\"imagenet\",\n", " include_top=False,\n", " pooling=\"avg\",\n", " input_shape=(IMG_SIZE, IMG_SIZE, 3),\n", " )\n", " preprocess_input = keras.applications.inception_v3.preprocess_input\n", "\n", " inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))\n", " preprocessed = preprocess_input(inputs)\n", "\n", " outputs = feature_extractor(preprocessed)\n", " return keras.Model(inputs, outputs, name=\"feature_extractor\")\n", "\n", "\n", "feature_extractor = build_feature_extractor()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['CricketShot', 'PlayingCello', 'Punch', 'ShavingBeard', 'TennisSwing']\n" ] } ], "source": [ "label_processor = keras.layers.StringLookup(\n", " num_oov_indices=0, vocabulary=np.unique(train_df[\"tag\"])\n", ")\n", "print(label_processor.get_vocabulary())" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Frame features in train set: (594, 20, 2048)\n", "Frame masks in train set: (594, 20)\n" ] } ], "source": [ "def prepare_all_videos(df, root_dir):\n", " num_samples = len(df)\n", " video_paths = df[\"video_name\"].values.tolist()\n", " labels = df[\"tag\"].values\n", " labels = label_processor(labels[..., None]).numpy()\n", "\n", " # `frame_masks` and `frame_features` are what we will feed to our sequence model.\n", " # `frame_masks` will contain a bunch of booleans denoting if a timestep is\n", " # masked with padding or not.\n", " frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype=\"bool\")\n", " frame_features = np.zeros(\n", " shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype=\"float32\"\n", " )\n", "\n", " # For each video.\n", " for idx, path in enumerate(video_paths):\n", " # Gather all its frames and add a batch dimension.\n", " frames = load_video(os.path.join(root_dir, path))\n", " frames = frames[None, ...]\n", "\n", " # Initialize placeholders to store the masks and features of the current video.\n", " temp_frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype=\"bool\")\n", " temp_frame_features = np.zeros(\n", " shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype=\"float32\"\n", " )\n", "\n", " # Extract features from the frames of the current video.\n", " for i, batch in enumerate(frames):\n", " video_length = batch.shape[0]\n", " length = min(MAX_SEQ_LENGTH, video_length)\n", " for j in range(length):\n", " temp_frame_features[i, j, :] = feature_extractor.predict(\n", " batch[None, j, :]\n", " )\n", " temp_frame_mask[i, :length] = 1 # 1 = not masked, 0 = masked\n", "\n", " frame_features[idx,] = temp_frame_features.squeeze()\n", " frame_masks[idx,] = temp_frame_mask.squeeze()\n", "\n", " return (frame_features, frame_masks), labels\n", "\n", "\n", "train_data, train_labels = prepare_all_videos(train_df, \"train\")\n", "test_data, test_labels = prepare_all_videos(test_df, \"test\")\n", "\n", "print(f\"Frame features in train set: {train_data[0].shape}\")\n", "print(f\"Frame masks in train set: {train_data[1].shape}\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from datetime import datetime\n", "from pathlib import Path\n", "\n", "def get_sequence_model():\n", " class_vocab = label_processor.get_vocabulary()\n", "\n", " frame_features_input = keras.Input((MAX_SEQ_LENGTH, NUM_FEATURES))\n", " mask_input = keras.Input((MAX_SEQ_LENGTH,), dtype=\"bool\")\n", "\n", " # Refer to the following tutorial to understand the significance of using `mask`:\n", " # https://keras.io/api/layers/recurrent_layers/gru/\n", " x = keras.layers.GRU(16, return_sequences=True)(\n", " frame_features_input, mask=mask_input\n", " )\n", " x = keras.layers.GRU(8)(x)\n", " x = keras.layers.Dropout(0.4)(x)\n", " x = keras.layers.Dense(8, activation=\"relu\")(x)\n", " output = keras.layers.Dense(len(class_vocab), activation=\"softmax\")(x)\n", "\n", " rnn_model = keras.Model([frame_features_input, mask_input], output)\n", "\n", " rnn_model.compile(\n", " loss=\"sparse_categorical_crossentropy\", optimizer=\"adam\", metrics=[\"accuracy\"]\n", " )\n", " return rnn_model\n", "\n", "logdir = f\"logs/scalars/{datetime.now().strftime('%Y%m%d-%H%M%S')}\"\n", "tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)\n", "\n", "def run_experiment():\n", " filepath = Path.cwd()\n", " checkpoint = keras.callbacks.ModelCheckpoint(\n", " filepath, save_weights_only=True, save_best_only=True, verbose=1\n", " )\n", "\n", " seq_model = get_sequence_model()\n", " history = seq_model.fit(\n", " [train_data[0], train_data[1]],\n", " train_labels,\n", " batch_size=BATCH_SIZE,\n", " validation_split=0.3,\n", " epochs=EPOCHS,\n", " callbacks=[\n", " checkpoint,\n", " tensorboard_callback,\n", " ],\n", " )\n", "\n", " seq_model.load_weights(filepath)\n", " _, accuracy = seq_model.evaluate([test_data[0], test_data[1]], test_labels)\n", " print(f\"Test accuracy: {round(accuracy * 100, 2)}%\")\n", "\n", " return history, seq_model" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/12\n", "7/7 [==============================] - 6s 300ms/step - loss: 1.3992 - accuracy: 0.4530 - val_loss: 1.4966 - val_accuracy: 0.3128\n", "\n", "Epoch 00001: val_loss improved from inf to 1.49663, saving model to /home/chainyo/code/video-classification-cnn-rnn\n", "Epoch 2/12\n", "7/7 [==============================] - 0s 16ms/step - loss: 1.0924 - accuracy: 0.7422 - val_loss: 1.4947 - val_accuracy: 0.3352\n", "\n", "Epoch 00002: val_loss improved from 1.49663 to 1.49470, saving model to /home/chainyo/code/video-classification-cnn-rnn\n", "Epoch 3/12\n", "7/7 [==============================] - 0s 16ms/step - loss: 0.9680 - accuracy: 0.8578 - val_loss: 1.4927 - val_accuracy: 0.3352\n", "\n", "Epoch 00003: val_loss improved from 1.49470 to 1.49274, saving model to /home/chainyo/code/video-classification-cnn-rnn\n", "Epoch 4/12\n", "7/7 [==============================] - 0s 15ms/step - loss: 0.8960 - accuracy: 0.8843 - val_loss: 1.4677 - val_accuracy: 0.3520\n", "\n", "Epoch 00004: val_loss improved from 1.49274 to 1.46771, saving model to /home/chainyo/code/video-classification-cnn-rnn\n", "Epoch 5/12\n", "7/7 [==============================] - 0s 15ms/step - loss: 0.8277 - accuracy: 0.9349 - val_loss: 1.4926 - val_accuracy: 0.3352\n", "\n", "Epoch 00005: val_loss did not improve from 1.46771\n", "Epoch 6/12\n", "7/7 [==============================] - 0s 15ms/step - loss: 0.7922 - accuracy: 0.9253 - val_loss: 1.4999 - val_accuracy: 0.3352\n", "\n", "Epoch 00006: val_loss did not improve from 1.46771\n", "Epoch 7/12\n", "7/7 [==============================] - 0s 16ms/step - loss: 0.7265 - accuracy: 0.9205 - val_loss: 1.4578 - val_accuracy: 0.3464\n", "\n", "Epoch 00007: val_loss improved from 1.46771 to 1.45775, saving model to /home/chainyo/code/video-classification-cnn-rnn\n", "Epoch 8/12\n", "7/7 [==============================] - 0s 17ms/step - loss: 0.7045 - accuracy: 0.9542 - val_loss: 1.4818 - val_accuracy: 0.3464\n", "\n", "Epoch 00008: val_loss did not improve from 1.45775\n", "Epoch 9/12\n", "7/7 [==============================] - 0s 16ms/step - loss: 0.6313 - accuracy: 0.9566 - val_loss: 1.5009 - val_accuracy: 0.3408\n", "\n", "Epoch 00009: val_loss did not improve from 1.45775\n", "Epoch 10/12\n", "7/7 [==============================] - 0s 15ms/step - loss: 0.5996 - accuracy: 0.9711 - val_loss: 1.5407 - val_accuracy: 0.3464\n", "\n", "Epoch 00010: val_loss did not improve from 1.45775\n", "Epoch 11/12\n", "7/7 [==============================] - 0s 15ms/step - loss: 0.5448 - accuracy: 0.9614 - val_loss: 1.5320 - val_accuracy: 0.3352\n", "\n", "Epoch 00011: val_loss did not improve from 1.45775\n", "Epoch 12/12\n", "7/7 [==============================] - 0s 15ms/step - loss: 0.5324 - accuracy: 0.9590 - val_loss: 1.5366 - val_accuracy: 0.3464\n", "\n", "Epoch 00012: val_loss did not improve from 1.45775\n", "7/7 [==============================] - 1s 4ms/step - loss: 0.9195 - accuracy: 0.7768\n", "Test accuracy: 77.68%\n" ] } ], "source": [ "history, sequence_model = run_experiment()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test video path: v_Punch_g03_c02.avi\n", " Punch: 56.50%\n", " TennisSwing: 29.97%\n", " PlayingCello: 6.47%\n", " ShavingBeard: 3.69%\n", " CricketShot: 3.38%\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def prepare_single_video(frames):\n", " frames = frames[None, ...]\n", " frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype=\"bool\")\n", " frame_features = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype=\"float32\")\n", "\n", " for i, batch in enumerate(frames):\n", " video_length = batch.shape[0]\n", " length = min(MAX_SEQ_LENGTH, video_length)\n", " for j in range(length):\n", " frame_features[i, j, :] = feature_extractor.predict(batch[None, j, :])\n", " frame_mask[i, :length] = 1 # 1 = not masked, 0 = masked\n", "\n", " return frame_features, frame_mask\n", "\n", "\n", "def sequence_prediction(path):\n", " class_vocab = label_processor.get_vocabulary()\n", "\n", " frames = load_video(os.path.join(\"test\", path))\n", " frame_features, frame_mask = prepare_single_video(frames)\n", " probabilities = sequence_model.predict([frame_features, frame_mask])[0]\n", "\n", " for i in np.argsort(probabilities)[::-1]:\n", " print(f\" {class_vocab[i]}: {probabilities[i] * 100:5.2f}%\")\n", " return frames\n", "\n", "\n", "# This utility is for visualization.\n", "# Referenced from:\n", "# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub\n", "def to_gif(images):\n", " converted_images = images.astype(np.uint8)\n", " imageio.mimsave(\"animation.gif\", converted_images, fps=10)\n", " return embed.embed_file(\"animation.gif\")\n", "\n", "\n", "test_video = np.random.choice(test_df[\"video_name\"].values.tolist())\n", "print(f\"Test video path: {test_video}\")\n", "test_frames = sequence_prediction(test_video)\n", "to_gif(test_frames[:MAX_SEQ_LENGTH])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Export to Hugging Face Hub" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/chainyo/code/video-classification-cnn-rnn is already a clone of https://huggingface.co./ChainYo/video-classification-cnn-rnn. Make sure you pull the latest changes with `repo.git_pull()`.\n", "WARNING:absl:Found untraced functions such as gru_cell_layer_call_and_return_conditional_losses, gru_cell_layer_call_fn, gru_cell_1_layer_call_and_return_conditional_losses, gru_cell_1_layer_call_fn, gru_cell_layer_call_fn while saving (showing 5 of 10). These functions will not be directly callable after loading.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Assets written to: /home/chainyo/code/video-classification-cnn-rnn/assets\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:Assets written to: /home/chainyo/code/video-classification-cnn-rnn/assets\n", "Several commits (2) will be pushed upstream.\n", "WARNING:huggingface_hub.repository:Several commits (2) will be pushed upstream.\n", "The progress bars may be unreliable.\n", "WARNING:huggingface_hub.repository:The progress bars may be unreliable.\n", "Upload file saved_model.pb: 4%|▍ | 224k/5.60M [00:01<00:28, 196kB/s]\n", "\u001b[ATo https://huggingface.co./ChainYo/video-classification-cnn-rnn\n", " e705e01..ed0a550 main -> main\n", "\n", "WARNING:huggingface_hub.repository:To https://huggingface.co./ChainYo/video-classification-cnn-rnn\n", " e705e01..ed0a550 main -> main\n", "\n", "Upload file saved_model.pb: 100%|██████████| 5.60M/5.60M [00:03<00:00, 1.94MB/s]\n", "Upload file keras_metadata.pb: 100%|██████████| 17.9k/17.9k [00:02