Small-Scale Proxies for Large-Scale Transformer Training Instabilities
By Mitchell W. Peter et al.
Published on Oct. 16, 2023
Read the original document by opening this link in a new tab.
Table of Contents
1 Introduction
2 Experimental Methodology
2.1 Experimental set-up
2.2 LR vs. loss curves and learning rate sensitivity
2.3 Scaling trends for model characteristics
3 Results
3.1 Reproducing two known instabilities at small scale
3.1.1 Attention logit growth
3.1.2 Output logit divergence
3.2 Measuring the effect of other known interventions
3.2.1 Warm-up
3.2.2 Independent weight decay
4 Conclusion
Summary
This paper discusses the training instabilities observed in large Transformer models and explores methods to reproduce and study these instabilities at smaller scales. Two sources of training instability are examined: the growth of logits in attention layers and divergence of output logits from log probabilities. The paper investigates the impact of various techniques such as warm-up, weight decay, and µParam on training stability across different model scales. Additionally, the study predicts instabilities in model characteristics and presents new scientific opportunities for studying training stability.