Scaling Vision Transformers to 22 Billion Parameters
By Mostafa Dehghani et al
Read the original document by opening this link in a new tab.
Table of Contents
Abstract
1 Introduction
2 Model Architecture
3 Training Infrastructure and Efficiency
4 Experiments
4.1 Training details
4.2 Transfer to image classification
4.2.1 Linear probing
4.2.2 Zero-shot via locked-image tuning
4.2.3 Out-of-distribution
4.3 Transfer to dense prediction
4.3.1 Semantic segmentation
4.3.2 Monocular depth estimation
4.4 Transfer to video classification
Summary
The document discusses the scaling of Vision Transformers to 22 billion parameters and the advancements in image and video modeling. It presents a recipe for training a 22B-parameter ViT and explores various experiments showcasing the model's performance. The paper covers topics such as training details, transfer to image classification, zero-shot transfer, out-of-distribution analysis, transfer to dense prediction tasks like semantic segmentation and monocular depth estimation, and transfer to video classification. The results demonstrate the effectiveness of the ViT-22B model in various tasks.