Transformer In Transformer

By Kai Han et al
Published on Oct. 26, 2021
Read the original document by opening this link in a new tab.

Table of Contents

1. Abstract
2. Introduction
3. Approach
4. Complexity Analysis
5. Network Architecture
6. Experiments

Summary

Transformer in Transformer (TNT) is a new architecture that enhances feature representation ability by dividing input images into visual sentences and words. It utilizes inner and outer transformer blocks to extract relationships among visual words and sentences. The proposed TNT model demonstrates superior performance on ImageNet benchmark with higher accuracy and efficiency compared to state-of-the-art transformer networks.
×
This is where the content will go.