Fast Inference from Transformers via Speculative Decoding

By Y. Leviathan et al

Published on June 10, 2023

Read the original document by opening this link in a new tab.

Abstract
1. Introduction
2. Speculative Decoding
3. Analysis
4. Number of Arithmetic Operations
5. Approximation Models

Summary

In this paper, the authors introduce speculative decoding, a method to accelerate inference from large autoregressive models like Transformers without changing the outputs. By using speculative execution and a novel sampling method, they demonstrate faster decoding by running models in parallel. The method accelerates inference without altering model architectures or output distributions. Key contributions include a generalization of speculative execution to the stochastic setting and a decoding mechanism called speculative decoding.

This is where the content will go.

Innervu Knowledge Navigator

Fast Inference from Transformers via Speculative Decoding

By Y. Leviathan et al

Published on June 10, 2023

Read the original document by opening this link in a new tab.

Table of Contents

Summary