Read the original document by opening this link in a new tab.
Table of Contents
1. Introduction
2. Related Work
3. Estimating the optimal parameter/training tokens allocation
3.1. Approach 1: Fix model sizes and vary number of training tokens
3.2. Approach 2: IsoFLOP profiles
3.3. Approach 3: Fitting a parametric loss function
Summary
The document discusses the optimal model size and number of training tokens for training large language models under a given compute budget. It highlights the importance of scaling model size and training tokens equally for compute-optimal training. The authors present three approaches to address this question and provide empirical estimations based on over 400 models. The findings suggest a different scaling relationship compared to previous works. The study emphasizes the significance of high-quality training data and the computational challenges of large language models.