Animage is Worth 16x16 Words: Transformers for Image Recognition at Scale

By Alexey Dosovitskiy et al
Published on June 10, 2021
Read the original document by opening this link in a new tab.

Table of Contents

1. Abstract
2. Introduction
3. Method
4. Experiments
5. Comparison to State of the Art

Summary

This paper presents the Vision Transformer (ViT) model for image recognition tasks, demonstrating that a pure transformer applied directly to sequences of image patches can perform well on image classification. ViT attains excellent results when pre-trained on large datasets and transferred to various image recognition benchmarks. The model architecture is based on the Transformer with self-attention layers and MLP blocks. The study shows that ViT has less image-specific inductive bias compared to CNNs, providing a detailed overview of the ViT model design and fine-tuning techniques. Experimental results show that ViT outperforms state-of-the-art CNN models on various benchmark tasks. The paper also discusses dataset scalability, model variants, training procedures, and comparison to existing models like BiT and Noisy Student. Overall, ViT demonstrates promising results in image recognition tasks at scale.
×
This is where the content will go.