Vision Transformer With Deformable Attention

By Z. Xia et al
Read the original document by opening this link in a new tab.

Table of Contents

1. Introduction
2. Related Work
3. Deformable Attention Transformer
3.1. Preliminaries
3.2. Deformable Attention
3.3. Computational Complexity

Summary

Transformers have shown superior performances on vision tasks by enhancing the representation power over CNN models. However, issues like excessive memory usage and computational cost arise. To address these challenges, a novel deformable self-attention module is proposed in this paper. By selecting key and value pairs in a data-dependent way, the module focuses on relevant regions and captures more informative features. This leads to the development of the Deformable Attention Transformer (DAT), a backbone model for image classification and dense prediction tasks. Extensive experiments demonstrate the superior performance of DAT over competitive baselines. The deformable attention mechanism efficiently models relations among tokens by focusing on important regions in the feature maps. By using shared shifted keys and values for each query, DAT achieves a flexible and efficient trade-off. The offset generation network learns offsets for reference points, and deformable relative position bias enhances the attention module with spatial information. The computational complexity of the deformable attention module is manageable and adds minimal overhead to existing architectures.
×
This is where the content will go.