Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers
Abstract
<PRE_TAG>Vision Transformers</POST_TAG> achieve impressive accuracy across a range of visual recognition tasks. Unfortunately, their accuracy frequently comes with high computational costs. This is a particular issue in video recognition, where models are often applied repeatedly across frames or temporal chunks. In this work, we exploit <PRE_TAG>temporal redundancy</POST_TAG> between subsequent inputs to reduce the cost of Transformers for video processing. We describe a method for identifying and re-processing only those <PRE_TAG>tokens</POST_TAG> that have changed significantly over time. Our proposed family of models, <PRE_TAG>Eventful Transformers</POST_TAG>, can be converted from existing Transformers (often without any re-training) and give adaptive control over the compute cost at runtime. We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100). Our approach leads to significant computational savings (on the order of 2-4x) with only minor reductions in accuracy.
Community
What kind of resources(GPUs etc.) are needed for minimal training for the purposes of learning ? Can I see some instructions ?
Code here: https://github.com/WISION-Lab/eventful-transformer/
For the most part, our method doesn't require any re-training. You can generally just use pre-trained weights (links on GitHub).
For fine-tuning the temporal component in Section 5.2, that took <2 days on one 3090 (if I remember correctly).
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper