Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers
By Damai Dai et al
Read the original document by opening this link in a new tab.
Table of Contents
1. Introduction
2. Background
3. Understanding In-Context Learning (ICL) as Implicit Finetuning
4. Experiments
Summary
Large pretrained language models have shown surprising in-context learning (ICL) ability. In this paper, language models are explained as meta-optimizers and in-context learning as implicit finetuning. The theoretical analysis reveals a dual form between Transformer attention and gradient descent. Empirical evidence supports the understanding that ICL behaves similarly to explicit finetuning. The experiments show that ICL can cover most correct predictions of finetuning and tends to change attention outputs in the same direction as finetuning. Overall, ICL is demonstrated to be an effective approach for enhancing performance in downstream tasks.