DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

By Reza Yazdani et al
Read the original document by opening this link in a new tab.

Table of Contents

Abstract
I. Introduction
II. Background and Related Work
III. Inference-Optimized Transformer Kernels
A. Inference Challenges on Different Batch Sizes
B. Deep-Fusion
C. Custom GeMM Kernel
IV. Many-GPU Dense Transformer Inference System
V. Massive-GPU Sparse Model Inference System
VI. ZeRO-Inference
VII. Extensive Evaluation of DeepSpeed Inference
VIII. Conclusion
References

Summary

The paper discusses the challenges and solutions for achieving high-performance inference of transformer models at various scales. It introduces DeepSpeed Inference, a comprehensive system solution that addresses latency, throughput, and resource limitations. The paper presents two main components: DeepSpeed Transformer for dense models and ZeRO-Inference for resource-constrained systems. Various parallelism strategies, including tensor and pipeline parallelism, are utilized to maximize memory bandwidth and compute utilization. Optimized transformer kernels and fusion techniques are employed to reduce kernel-invocation overhead and improve memory bandwidth utilization. Extensive evaluations demonstrate the effectiveness of DeepSpeed Inference in achieving state-of-the-art latency reduction and throughput improvement for transformer models of varying sizes.
×
This is where the content will go.