Language Models Are Few-Shot Learners

By Tom B. et al
Published on July 22, 2020
Read the original document by opening this link in a new tab.

Table of Contents

1 Introduction
2 Approach
2.1 Model and Architectures
2.2 Training Dataset
2.3 Training Process
2.4 Evaluation
3 Results
3.1 Language Modeling, Cloze, and Completion Tasks
3.2 Closed Book Question Answering
3.3 Translation
3.4 Winograd-Style Tasks
3.5 Common Sense Reasoning
3.6 Reading Comprehension
3.7 SuperGLUE
3.8 NLI
3.9 Synthetic and Qualitative Tasks
4 Measuring and Preventing Memorization Of Benchmarks
5 Limitations
6 Broader Impacts
6.1 Misuse of Language Models
6.2 Fairness, Bias, and Representation
6.3 Energy Usage
7 Related Work
8 Conclusion
A Details of Common Crawl Filtering
B Details of Model Training
C Details of Test Set Contamination Studies
D Total Compute Used to Train Language Models
E Human Quality Assessment of Synthetic News Articles
F Additional Samples from GPT-3
G Details of Task Phrasing and Specifications
H Results on All Tasks for All Model Sizes

Summary

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
×
This is where the content will go.