Sparse Is Enough In Scaling Transformers

By S. Jaszczur et al
Read the original document by opening this link in a new tab.

Table of Contents

Abstract
1 Introduction
2 Related Work
3 Sparse is Enough
3.1 Sparse Feedforward Layer
3.2 Sparse QKV Layer

Summary

Large Transformer models yield impressive results but are expensive to train and decode. Scaling Transformers proposes sparse layers to scale efficiently and achieve competitive performance. The study focuses on sparsifying every part of the Transformer model to enhance decoding speed. The Sparse Feedforward Layer introduces dynamic sparsity to reduce decoding time significantly. The Sparse QKV Layer subdivides the layer dimensionality and uses a multiplicative dense layer for faster computation. The model achieves comparable results to the baseline while decoding much faster.
×
This is where the content will go.