Scaling Vision Transformers

By X. Zhai et al
Published on June 20, 2022
Read the original document by opening this link in a new tab.

Table of Contents

1. Introduction
2. Core Results
3. Method details

Summary

This paper explores the scaling properties of Vision Transformers, focusing on the importance of scale in achieving excellent results in computer vision tasks. The study involves scaling ViT models and data, refining architecture and training methods, resulting in a ViT model with two billion parameters achieving state-of-the-art performance on ImageNet. The paper presents detailed results on scaling trends, sample efficiency, and the impact of dataset size on model performance. Additionally, improvements to the ViT model and training methods are discussed, including decoupled weight decay for the head and body, and memory-saving techniques by removing the [class] token.
×
This is where the content will go.