Vision Transformer: Vit and its Derivatives

By Zujun Fu et al
Published on May 25, 2022
Read the original document by opening this link in a new tab.

Table of Contents

1 Pyramid Vision Transformer
2 Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
3 Scaling Vision Transformer
4 Replacing self-attention: independent token + channel mixing methods
4.1 MLP-Mixer
4.2 XCiT: Cross-Covariance Image Transformers
4.3 ConvMixer
5 Multiscale Vision Transformers
6 Video classification: Timesformer
7 ViT in semantic segmentation: SegFormer
8 Vision Transformers in Medical imaging: Unet + ViT = UNETR

Summary

Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing (NLP) and computer vision (CV). Vision Transformer (ViT) achieves good performance on benchmarks like ImageNet, COCO, and ADE20k by replacing word embeddings with patch embeddings. This paper reviews derivatives of ViT and its applications in various fields, such as Pyramid Vision Transformer, Swin Transformer, Scaling Vision Transformer, and more. The document discusses advancements like ConvMixer, Multiscale Vision Transformers, and the application of ViT in semantic segmentation using SegFormer. Additionally, it introduces UNETR, which combines Unet model with ViT for 3D medical image segmentation.
×
This is where the content will go.