Patch N’ Pack: NaViT, a Vision Transformer for Any Aspect Ratio and Resolution

By Mostafa Dehghani et al
Read the original document by opening this link in a new tab.

Table of Contents

1 Introduction
2 Method
3 Experiments
3.1 Improved training efficiency and performance
3.2 Benefits of variable resolution
3.3 Benefits of variable token dropping
3.4 Positional embeddings
3.5 Other aspects of NaViT’s performance

Summary

The document discusses NaViT, a Vision Transformer that introduces the concept of Patch n’ Pack, enabling variable resolution and aspect ratio processing. It challenges the standard practice of resizing images to fixed resolutions before processing with computer vision models. NaViT offers improved training efficiency and performance, demonstrating high performance across various resolutions and tasks like image and video classification, object detection, and semantic segmentation. The document also highlights the benefits of variable resolution training, continuous token dropping strategies, and factorized positional embeddings for better generalization to new resolutions and aspect ratios. NaViT shows promising out-of-distribution generalization on datasets like ImageNet, ImageNet-A, and ObjectNet. Overall, NaViT presents a departure from traditional CNN-designed models and offers a flexible and efficient approach for Vision Transformers.
×
This is where the content will go.