Read the original document by opening this link in a new tab.
Table of Contents
1 Introduction
2 Training details and the importance of high-quality data
2.1 Filtering of existing code datasets using a transformer-based classifier
2.2 Creation of synthetic textbook-quality datasets
Summary
We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of 'textbook quality' data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. In this work, following the footsteps of Eldan and Li, we explore the improvement that can be obtained along a different axis: the quality of the data. We focus our attention on LLMs trained for code, and specifically writing simple Python functions from their docstrings. Our model relies on textbook-quality training data, and we show that by intentionally selecting and generating high-quality data, we can achieve state-of-the-art results on code-generation tasks with a much smaller model and less compute than existing approaches. The creation of synthetic textbook-quality datasets and the CodeExercises dataset are key components of our data selection process. The synthetic textbook dataset consists of GPT-3.5 generated Python textbooks, while the CodeExercises dataset is a small synthetic exercises dataset for function completion tasks based on natural language instructions.