Read the original document by opening this link in a new tab.
Table of Contents
1 Introduction
2 Background
3 FlashAttention : Algorithm, Analysis, and Extensions
Summary
FlashAttention is a fast and memory-efficient exact attention algorithm that addresses the issues faced by Transformers on long sequences. By introducing IO-awareness and using tiling techniques, FlashAttention reduces the number of memory reads/writes between different levels of GPU memory, leading to significant speedups in model training. The algorithm computes exact attention with fewer memory accesses, making it both memory-efficient and faster in wall-clock time. The analysis of its IO complexity shows a substantial reduction in HBM accesses compared to standard attention methods. Additionally, the extension of FlashAttention to block-sparse attention further improves efficiency and scalability. The empirical results demonstrate faster model training and higher quality models, with significant speedups over existing attention methods.