The Expressibility of Polynomial Based Attention Scheme

By Zhao Song et al
Published on Oct. 30, 2023
Read the original document by opening this link in a new tab.

Table of Contents

1 Introduction
2 Related Work
3 Preliminary
3.1 Probability Tools
3.2 Definitions of Functions
3.3 Definition of Datasets for Binary Classification
4 Property of Dataset
5 Binary Classification
5.1 High-Degree Polynomial Attention
5.2 Low-Degree Polynomial Attention
6 Self-Attention Dataset
6.1 Definition of Dataset
6.2 Definitions of Functions
6.3 Main Result
6.4 Dataset 1 When Applying FpolyWith Large β–A High-Degree Polynomial Case
6.4.1 The Property of Dataset 1 When Applying Functions upolyandfpoly
6.4.2 The Property of Dataset 1 When Applying the Function cpoly
6.4.3 The Property of Dataset 1 When Applying the Function cpolyWith Random Signs
6.4.4 The Property of Dataset 1 When Applying the Function Fpoly
6.5 Dataset 0 When Applying FpolyWith Large β–A High-Degree Polynomial Case
6.5.1 The Property of Dataset 0 When Applying Functions upolyandfpoly
6.5.2 The Property of Dataset 0 When Applying the Function cpoly
6.5.3 The Property of Dataset 0 When Applying the Function cpolyWith Random Signs
6.5.4 The Property of Dataset 0 When Applying the Function Fpoly
6.6 Dataset 1 When Applying FpolyWith Small β–A Low-Degree Polynomial Case
6.6.1 The Property of Dataset 1 When Applying Functions upolyandfpoly
6.6.2 The Property of Dataset 1 When Applying the Function cpoly
6.6.3 The Property of Dataset 1 When Applying the Function cpolyWith Random Signs
6.6.4 The Property of Dataset 1 When Applying the Function Fpoly
6.7 Dataset 0 When Applying FpolyWith Small β–A Low-Degree Polynomial Case
6.7.1 The Property of Dataset 0 When Applying Functions upolyandfpoly
6.7.2 The Property of Dataset 0 When Applying the Function cpoly
6.7.3 The Property of Dataset 0 When Applying the Function cpolyWith Random Signs
6.7.4 The Property of Dataset 0 When Applying the Function Fpoly

Summary

Large language models (LLMs) have significantly improved various aspects of our daily lives. They have served as the foundation for virtual assistants, streamlining information retrieval and task automation seamlessly. These models have impacted numerous domains, from healthcare to education, enhancing productivity, decision-making processes, and accessibility. However, the quadratic complexity of attention in transformer architectures poses a challenge when scaling up these models for processing long textual contexts. This issue makes it impractical to train very large models on lengthy texts or use them efficiently during inference. In this paper, a theoretical analysis of the expressive capabilities of polynomial attention is provided. It reveals a disparity in the ability of high-degree and low-degree polynomial attention, showcasing the effectiveness of high-degree polynomials in amplifying large values and distinguishing between datasets. The study justifies the use of higher-degree polynomials in attention mechanisms to capture complex linguistic correlations.
×
This is where the content will go.