Textbooks Are All You Need

By Suriya Gunasekar et al

Read the original document by opening this link in a new tab.

1 Introduction
2 Training details and the importance of high-quality data
2.1 Filtering of existing code datasets using a transformer-based classifier
2.2 Creation of synthetic textbook-quality datasets

Summary

We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of 'textbook quality' data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. In this work, following the footsteps of Eldan and Li, we explore the improvement that can be obtained along a different axis: the quality of the data. We focus our attention on LLMs trained for code, and specifically writing simple Python functions from their docstrings. Our model relies on textbook-quality training data, and we show that by intentionally selecting and generating high-quality data, we can achieve state-of-the-art results on code-generation tasks with a much smaller model and less compute than existing approaches. The creation of synthetic textbook-quality datasets and the CodeExercises dataset are key components of our data selection process. The synthetic textbook dataset consists of GPT-3.5 generated Python textbooks, while the CodeExercises dataset is a small synthetic exercises dataset for function completion tasks based on natural language instructions.

This is where the content will go.

Innervu Knowledge Navigator

Textbooks Are All You Need

By Suriya Gunasekar et al

Read the original document by opening this link in a new tab.

Table of Contents

Summary