Read the original document by opening this link in a new tab.
Table of Contents
1. Introduction and Related Work
2. Favor+ Mechanism & Positive Orthogonal Random Features
3. Theoretical Results
Summary
The document introduces Performers, linear architectures that can estimate regular full-rank attention with provable accuracy using linear space and time complexity. It discusses the limitations of traditional Transformers due to quadratic scaling with sequence length and presents the Fast Attention Via positive Orthogonal Random features approach for efficient attention mechanisms. The paper details the mechanisms for approximating softmax and Gaussian kernels and the use of Positive Random Features for softmax estimation. Theoretical results are provided on the accuracy and variance reduction achieved by Positive Orthogonal Random Features. The document highlights the practical implications and advantages of using Performers in various tasks.