Longnet: Scaling Transformers to 1,000,000,000 Tokens
By J. Ding et al
Published on July 19, 2023
Read the original document by opening this link in a new tab.
Table of Contents
Abstract
1 Introduction
2 Longnet
2.1 Preliminary
2.2 Dilated Attention
2.3 Multi-Head Dilated Attention
2.4 Computational Complexity and Token Dependency
3 Longnet as a Distributed Trainer
3.1 Distributed Algorithm
3.2 Scaling up to 1B Tokens
4 Experiments on Language Modeling
4.1 Setup
4.2 Results
4.3 Scaling Curves of Sequence Length
Summary
The document discusses the development of LONGNET, a Transformer variant that can scale sequence length to over 1 billion tokens. The key innovation is dilated attention, which reduces computation complexity and enables distributed training for extremely long sequences. Experiments show LONGNET outperforms dense Transformers in language modeling tasks. The scaling curves demonstrate improved performance with longer context lengths during training.