Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
By Zihang Dai et al
Published on June 10, 2019
Read the original document by opening this link in a new tab.
Table of Contents
Abstract
1 Introduction
2 Related Work
3 Model
3.1 Vanilla Transformer Language Models
3.2 Segment-Level Recurrence with State Reuse
3.3 Relative Positional Encodings
Summary
Transformers have potential for learning longer-term dependency but are limited by a fixed-length context. Transformer-XL proposes a novel architecture enabling learning dependency beyond a fixed length without disrupting temporal coherence. It consists of segment-level recurrence and a novel positional encoding scheme. This method captures longer-term dependency, resolves context fragmentation, and achieves better performance compared to RNNs and vanilla Transformers.