Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

By S. Luo et al
Published on Nov. 3, 2021
Read the original document by opening this link in a new tab.

Table of Contents

Abstract
1 Introduction
2 Preliminary
2.1 Attention Module and its Kernel View
2.2 Relative Positional Encoding
2.3 Incorporating RPE into Kernelized Attention
3 Attention with RPE Goes Beyond the Dot-then-exponentiate Function Class
3.1 Attention with RPE Goes Beyond the Dot-then-exponentiate Function Class
3.2 Fast Attention Calculation using Fast Fourier Transform
3.3 RPE Enables Stable Training of Kernelized Attention
4 Conclusion
Appendix A. Proof of Proposition 1
Appendix B. Proofs of Section 3

Summary

The document discusses the challenges and solutions related to Transformer models with attention mechanisms, focusing on the introduction of kernelized attention with relative positional encoding (RPE). It addresses the limitations of standard attention mechanisms in handling long sequences and proposes a novel method to accelerate attention calculation using Fast Fourier Transform (FFT) in the context of RPE. The paper also explores the stability of training kernelized attention models with RPE, highlighting the importance of feature map dimensions and query/key norms in achieving accurate approximations. Experimental and theoretical analyses support the effectiveness of the proposed approaches in improving performance and efficiency of Transformer models with RPE.
×
This is where the content will go.