Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
By Mohammad Shoeybi et al
Published on March 13, 2020
Read the original document by opening this link in a new tab.
Table of Contents
1. Introduction
2. Background and Challenges
2.1. Neural Language Model Pretraining
2.2. Transformer Language Models and Multi-Head Attention
2.3. Data and Model Parallelism in Deep Learning
3. Model Parallel Transformers
Summary
Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. This paper presents techniques for training very large transformer models using model parallelism, enabling training transformer models with billions of parameters. The approach does not require new compiler or library changes and is fully implemented with native PyTorch. The paper showcases results of training transformer models with up to 8.3 billion parameters on 512 GPUs, achieving significant scaling efficiency. Attention is given to optimizing model size for increased accuracy, particularly in BERT-like models. The paper also highlights achieving state-of-the-art results on various datasets. The approach of model parallelism in transformer networks is detailed, focusing on optimizing communication operations and workload distribution across GPUs.