Beyond The Imitation Game: Quantifying And Extrapolating The Capabilities Of Language Models

By Aarohi Srivastava et al
Published on June 12, 2023
Read the original document by opening this link in a new tab.

Table of Contents

1 Introduction
1.1 Quantity has a quality all its own
1.2 Limitations of current benchmarks
1.3 Beyond the imitation game
2 What is in BIG-bench?
2.1 The BIG-bench API
2.2 BIG-bench Lite
2.3 Evaluation targets
3 Behavior of language models and human raters on BIG-bench
3.1 Aggregate performance improves with model size but is worse than human performance
3.2 Model predictions grow better calibrated with increased scale
3.3 Model classes behave similarly, with benefits from sparsity
3.4 Linearity and breakthroughness: categorizing scaling behaviors
3.5 Even large language models are brittle
3.6 Social bias in large language models
3.7 Performance on non-English languages
4 Behavior on selected tasks
4.1 Checkmate-in-one task
4.2 Periodic elements task
5 Additional related work
6 Discussion
6.1 Overall Findings
6.2 Conclusion
7 Author contributions
7.1 Core contributors
7.2 Task authors, and task-specific citations
A Model details
B Definitions of linearity and breakthroughness
C Additional evaluation results
D Performance by keyword
E Expert evaluation
F Additional Tasks

Summary

Generative language models have as their core capability the production of the most likely continuation for a text sequence. This seemingly simple skill is remarkably general. Any task that can be specified and executed via text can be framed as text continuation. This encompasses a wide range of cognitive tasks, including tasks that can be resolved over chat or email, for example, or in a web forum. A recent consensus is that as generative language models are made larger, and are trained on more data, they perform better in predictable ways. Their cross entropy on a test set scales as a power law with regards to model size, training data size, and the amount of compute used in training. Motivated by this predictable improvement, researchers have now scaled language models to more than one trillion parameters, and we expect models to grow orders of magnitude larger over the next several years. We also expect continued performance gains from improvements in architecture and training methods.
×
This is where the content will go.