Do Vision Transformers See Like Convolutional Neural Networks?

By Maithra Raghu et al
Read the original document by opening this link in a new tab.

Table of Contents

1 Introduction
2 Related Work
3 Background and Experimental Setup
4 Representation Structure of ViTs and Convolutional Networks
5 Local and Global Information in Layer Representations
6 Representation Propagation through Skip Connections

Summary

Convolutional neural networks (CNNs) have been the standard model for visual data, but recent work has shown that Vision Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This paper investigates the differences between ViTs and CNNs in terms of internal representation structure, incorporation of local and global spatial information, and the effects of dataset scale on transfer learning. The study reveals that ViTs exhibit more uniform representations across layers, incorporate more global information in early layers, and have different features learned compared to CNNs. Furthermore, ViTs propagate representations differently due to skip connections, leading to distinct representations in lower layers. The findings shed light on the unique capabilities and mechanisms of Vision Transformers in image tasks.
×
This is where the content will go.