Read the original document by opening this link in a new tab.
Table of Contents
Abstract
1 Introduction
2 Related Work
3 Sparse is Enough
3.1 Sparse Feedforward Layer
3.2 Sparse QKV Layer
Summary
Large Transformer models yield impressive results but are expensive to train and decode. Scaling Transformers proposes sparse layers to scale efficiently and achieve competitive performance. The study focuses on sparsifying every part of the Transformer model to enhance decoding speed. The Sparse Feedforward Layer introduces dynamic sparsity to reduce decoding time significantly. The Sparse QKV Layer subdivides the layer dimensionality and uses a multiplicative dense layer for faster computation. The model achieves comparable results to the baseline while decoding much faster.