DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

By Sang Michael Xie et al
Published on Nov. 21, 2023
Read the original document by opening this link in a new tab.

Table of Contents

Abstract
1 Introduction
2 Domain Reweighting with Minimax Optimization (DoReMi)
3 DoReMi Improves LM Training Efficiency and Performance
Experimental setup
DoReMi improves perplexity and downstream accuracy
Table 1: Domain weights on The Pile
Figure 3: Average one-shot downstream accuracy
Figure 4: Per-domain log-perplexity of 8B models on The Pile

Summary

The document discusses the DoReMi algorithm for optimizing domain weights in language model pretraining. It introduces a method to improve downstream performance and training efficiency by optimizing the mixture proportions of pretraining data domains. DoReMi leverages distributionally robust optimization to tune domain weights without knowledge of downstream tasks. Experimental results show that DoReMi significantly improves downstream accuracy and reduces perplexity across all domains on datasets like The Pile and the GLaM dataset. The algorithm is able to achieve baseline accuracy faster and enhance overall LM training efficiency.
×
This is where the content will go.