Summary
This paper explores the scaling properties of Vision Transformers, focusing on the importance of scale in achieving excellent results in computer vision tasks. The study involves scaling ViT models and data, refining architecture and training methods, resulting in a ViT model with two billion parameters achieving state-of-the-art performance on ImageNet. The paper presents detailed results on scaling trends, sample efficiency, and the impact of dataset size on model performance. Additionally, improvements to the ViT model and training methods are discussed, including decoupled weight decay for the head and body, and memory-saving techniques by removing the [class] token.