Transcending Scaling Laws with 0.1% Extra Compute

By Yi Tay et al.
Published on Nov. 16, 2022
Read the original document by opening this link in a new tab.

Table of Contents

1. Introduction
2. Related Work
3. U-PaLM
3.1. Training Data
3.2. Prefix Language Model Architecture
3.3. Loss Objectives
3.4. Training
4. Experiments
4.1. Improved Scaling Properties on Few-shot Learning
4.2. BigBench Emergent Suite
5. Conclusion
6. References

Summary

Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training an existing causal language model with a mixture of new objectives—specifically, the UL2 training objective mixture. UL2R unlocks emergent task performance at smaller scales and shows substantial gains on various tasks. The results on the BigBench emergent suite tasks demonstrate the effectiveness of UL2R in improving the quality of large language models. Overall, this work highlights the importance of diverse pretraining schemes in enhancing language model performance.
×
This is where the content will go.