Flashattention

By Tri Daoy et al

Published on June 24, 2022

Read the original document by opening this link in a new tab.

1 Introduction
2 Background
3 FlashAttention : Algorithm, Analysis, and Extensions

Summary

FlashAttention is a fast and memory-efficient exact attention algorithm that addresses the issues faced by Transformers on long sequences. By introducing IO-awareness and using tiling techniques, FlashAttention reduces the number of memory reads/writes between different levels of GPU memory, leading to significant speedups in model training. The algorithm computes exact attention with fewer memory accesses, making it both memory-efficient and faster in wall-clock time. The analysis of its IO complexity shows a substantial reduction in HBM accesses compared to standard attention methods. Additionally, the extension of FlashAttention to block-sparse attention further improves efficiency and scalability. The empirical results demonstrate faster model training and higher quality models, with significant speedups over existing attention methods.

This is where the content will go.

Innervu Knowledge Navigator

Flashattention

By Tri Daoy et al

Published on June 24, 2022

Read the original document by opening this link in a new tab.

Table of Contents

Summary