StarCoder: May the Source Be with You

By Raymond Li et al
Published on Dec. 10, 2023
Read the original document by opening this link in a new tab.

Table of Contents

Abstract
1 Introduction
2 Related Work
3 Data Curation and Cleaning

Summary

The BigCode community introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, fine-tuned on 35B Python tokens to create StarCoder. StarCoder outperforms other models in code generation. Legal and privacy concerns around LLMs are discussed, emphasizing the need for transparency. StarCoder models are released under an OpenRAIL-M license agreement for safe model release.
×
This is where the content will go.