Fast Inference from Transformers via Speculative Decoding
By Y. Leviathan et al
Published on June 10, 2023
Read the original document by opening this link in a new tab.
Table of Contents
Abstract
1. Introduction
2. Speculative Decoding
3. Analysis
4. Number of Arithmetic Operations
5. Approximation Models
Summary
In this paper, the authors introduce speculative decoding, a method to accelerate inference from large autoregressive models like Transformers without changing the outputs. By using speculative execution and a novel sampling method, they demonstrate faster decoding by running models in parallel. The method accelerates inference without altering model architectures or output distributions. Key contributions include a generalization of speculative execution to the stochastic setting and a decoding mechanism called speculative decoding.