Fast Inference from Transformers via Speculative Decoding

By Y. Leviathan et al
Published on June 10, 2023
Read the original document by opening this link in a new tab.

Table of Contents

Abstract
1. Introduction
2. Speculative Decoding
3. Analysis
4. Number of Arithmetic Operations
5. Approximation Models

Summary

In this paper, the authors introduce speculative decoding, a method to accelerate inference from large autoregressive models like Transformers without changing the outputs. By using speculative execution and a novel sampling method, they demonstrate faster decoding by running models in parallel. The method accelerates inference without altering model architectures or output distributions. Key contributions include a generalization of speculative execution to the stochastic setting and a decoding mechanism called speculative decoding.
×
This is where the content will go.