Summary
Transformer in Transformer (TNT) is a new architecture that enhances feature representation ability by dividing input images into visual sentences and words. It utilizes inner and outer transformer blocks to extract relationships among visual words and sentences. The proposed TNT model demonstrates superior performance on ImageNet benchmark with higher accuracy and efficiency compared to state-of-the-art transformer networks.