Exploring Plain Vision Transformer Backbones for Object Detection
Abstract
We explore the plain, non-hierarchical <PRE_TAG>Vision Transformer (ViT)</POST_TAG> as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a <PRE_TAG>single-scale feature map</POST_TAG> (without the common FPN design) and (ii) it is sufficient to use <PRE_TAG>window attention</POST_TAG> (without shifting) aided with very few <PRE_TAG>cross-window propagation blocks</POST_TAG>. With plain ViT backbones pre-trained as <PRE_TAG>Masked Autoencoders (MAE)</POST_TAG>, our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the <PRE_TAG>COCO dataset</POST_TAG> using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.
Community
Simplifying Object Detection with Plain Vision Transformers!
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper