Longnet: Scaling Transformers to 1,000,000,000 Tokens

By J. Ding et al
Published on July 19, 2023
Read the original document by opening this link in a new tab.

Table of Contents

Abstract
1 Introduction
2 Longnet
2.1 Preliminary
2.2 Dilated Attention
2.3 Multi-Head Dilated Attention
2.4 Computational Complexity and Token Dependency
3 Longnet as a Distributed Trainer
3.1 Distributed Algorithm
3.2 Scaling up to 1B Tokens
4 Experiments on Language Modeling
4.1 Setup
4.2 Results
4.3 Scaling Curves of Sequence Length

Summary

The document discusses the development of LONGNET, a Transformer variant that can scale sequence length to over 1 billion tokens. The key innovation is dilated attention, which reduces computation complexity and enables distributed training for extremely long sequences. Experiments show LONGNET outperforms dense Transformers in language modeling tasks. The scaling curves demonstrate improved performance with longer context lengths during training.
×
This is where the content will go.