Efficient Streaming Language Models with Attention Sinks
By Guangxuan Xiao et al
Published on April 7, 2024
Read the original document by opening this link in a new tab.
Table of Contents
1. Introduction
2. Related Work
3. Streaming LLM
3.1 The Failure of Window Attention and Attention Sinks
Summary
Published as a conference paper at ICLR 2024, this paper introduces the concept of efficient streaming language models with attention sinks. It addresses the challenges faced by Large Language Models (LLMs) in streaming applications and proposes the StreamingLLM framework. The paper discusses the importance of initial tokens as attention sinks and how preserving them can stabilize the model's performance. It also presents insights into the failure of window attention and highlights the significance of attention sinks for language modeling. The paper concludes by emphasizing the benefits of StreamingLLM for handling limitless inputs in a streaming context.