arxiv:2308.13494

Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

Published on Aug 25, 2023

· Submitted by

akhaliq on Aug 28, 2023

Authors:

Abstract

<PRE_TAG>Vision Transformers</POST_TAG> achieve impressive accuracy across a range of visual recognition tasks. Unfortunately, their accuracy frequently comes with high computational costs. This is a particular issue in video recognition, where models are often applied repeatedly across frames or temporal chunks. In this work, we exploit <PRE_TAG>temporal redundancy</POST_TAG> between subsequent inputs to reduce the cost of Transformers for video processing. We describe a method for identifying and re-processing only those <PRE_TAG>tokens</POST_TAG> that have changed significantly over time. Our proposed family of models, <PRE_TAG>Eventful Transformers</POST_TAG>, can be converted from existing Transformers (often without any re-training) and give adaptive control over the compute cost at runtime. We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100). Our approach leads to significant computational savings (on the order of 2-4x) with only minor reductions in accuracy.

View arXiv page View PDF Add to collection