Retentive Network: A Successor to Transformer for Large Language Models
By Y. Sun et al
Published on Aug. 9, 2023
Read the original document by opening this link in a new tab.
Table of Contents
Abstract
1 Introduction
2 Retention
2.1 Retention
2.2 Gated Multi-Scale Retention
2.3 Overall Architecture of Retention Networks
2.4 Relation to and Differences from Previous Methods
3 Experiments
3.1 Setup
3.2 Comparisons with Transformer
3.3 Training Cost
Summary
Retentive Network (RETNET) is proposed as a foundation architecture for large language models, aiming to achieve training parallelism, low-cost inference, and good performance. The retention mechanism for sequence modeling supports three computation paradigms: parallel, recurrent, and chunkwise recurrent representations. Experimental results show that RETNET achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The model outperforms Transformer when the size is larger than 2B. Additionally, zero-shot and few-shot learning evaluations on various downstream tasks demonstrate competitive performance. Training cost comparisons with Transformer and FlashAttention show that RETNET offers advantages in memory consumption and training throughput.