FDViT: Improve the Hierarchical Architecture of Vision Transformer (ICCV 2023)
Yixing Xu, Chao Li, Dong Li, Xiao Sheng, Fan Jiang, Lu Tian, Ashish Sirasao | Paper
Advanced Micro Devices, Inc.
Dependancies
torch == 1.13.1
torchvision == 0.14.1
timm == 0.6.12
einops == 0.6.1
Model performance
The image classification results of FDViT models on ImageNet dataset are shown in the following table.
Model | Parameters (M) | FLOPs(G) | Top-1 Accuracy (%) |
---|---|---|---|
FDViT-Ti | 4.6 | 0.6 | 73.74 |
FDViT-S | 21.6 | 2.8 | 81.45 |
FDViT-B | 68.1 | 11.9 | 82.39 |
Model Usage
from transformers import AutoModelForImageClassification
import torch
model = AutoModelForImageClassification.from_pretrained("FDViT_b", trust_remote_code=True)
model.eval()
inp = torch.ones(1,3,224,224)
out = model(inp)
Citation
@inproceedings{xu2023fdvit,
title={FDViT: Improve the Hierarchical Architecture of Vision Transformer},
author={Xu, Yixing and Li, Chao and Li, Dong and Sheng, Xiao and Jiang, Fan and Tian, Lu and Sirasao, Ashish},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={5950--5960},
year={2023}
}
- Downloads last month
- 8