Read the original document by opening this link in a new tab.
Table of Contents
1 Introduction
2 Computation Efficiency
2.1 Optimization
2.2 Data Selection
3 Memory Efficiency
Summary
This paper provides a systematic overview of the efficient training of Transformers, focusing on acceleration arithmetic and hardware. It discusses the challenges of training large Transformer models efficiently, including issues related to computation and memory costs. Various techniques such as sparse training, overparameterization, large batch training, and incremental learning are analyzed. Additionally, the paper explores methods for improving data efficiency through token masking and importance sampling. The memory efficiency of training frameworks is also discussed, emphasizing parallelism as a common practice to meet memory demands in training large models.