Retentive Network: A Successor to Transformer for Large Language Models

By Y. Sun et al

Published on Aug. 9, 2023

Read the original document by opening this link in a new tab.

Abstract
1 Introduction
2 Retention
2.1 Retention
2.2 Gated Multi-Scale Retention
2.3 Overall Architecture of Retention Networks
2.4 Relation to and Differences from Previous Methods
3 Experiments
3.1 Setup
3.2 Comparisons with Transformer
3.3 Training Cost

Summary

Retentive Network (RETNET) is proposed as a foundation architecture for large language models, aiming to achieve training parallelism, low-cost inference, and good performance. The retention mechanism for sequence modeling supports three computation paradigms: parallel, recurrent, and chunkwise recurrent representations. Experimental results show that RETNET achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The model outperforms Transformer when the size is larger than 2B. Additionally, zero-shot and few-shot learning evaluations on various downstream tasks demonstrate competitive performance. Training cost comparisons with Transformer and FlashAttention show that RETNET offers advantages in memory consumption and training throughput.

This is where the content will go.

Innervu Knowledge Navigator

Retentive Network: A Successor to Transformer for Large Language Models

By Y. Sun et al

Published on Aug. 9, 2023

Read the original document by opening this link in a new tab.

Table of Contents

Summary