Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
By Ze Liu et al
Published on Aug. 17, 2021
Read the original document by opening this link in a new tab.
Table of Contents
1. Introduction
2. Related Work
3. Method
3.1. Overall Architecture
3.2. Shifted Window based Self-Attention
Summary
This paper presents a new vision Transformer, called Swin Transformer, that serves as a general-purpose backbone for computer vision. It addresses challenges in adapting Transformer from language to vision by proposing a hierarchical Transformer with representation computed using Shifted windows. The hierarchical architecture allows modeling at various scales and has linear computational complexity with respect to image size. Swin Transformer demonstrates strong performance in various vision tasks including image classification, object detection, and semantic segmentation. The shifted window approach enhances modeling power and efficiency in self-attention computation. The paper introduces an efficient batch computation approach for self-attention in shifted window partitioning, showing promising results in image classification, object detection, and semantic segmentation tasks.