Swin Transformer V2: Scaling Up Capacity and Resolution

By Ze Liu et al
Read the original document by opening this link in a new tab.

Table of Contents

Abstract
1. Introduction
2. Related Works
3. Swin Transformer V2
3.1. A Brief Review of Swin Transformer
3.2. Scaling Up Model Capacity
3.3. Scaling Up Window Resolution

Summary

Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. This paper aims to explore large-scale models in computer vision, specifically the Swin Transformer V2 model. The authors address issues such as training instability, resolution gaps, and hunger for labeled data. Techniques such as residual post-norm method, log-spaced continuous position bias, and self-supervised pre-training are proposed to tackle these challenges. The Swin Transformer V2 model successfully sets new performance records on various vision tasks, demonstrating strong performance and efficiency compared to previous models. The paper emphasizes the importance of scaling up vision models to bridge the gap between vision and language models.
×
This is where the content will go.