Read the original document by opening this link in a new tab.
Table of Contents
Abstract
1 Introduction
2 Related Work
3 Data Curation and Cleaning
Summary
The BigCode community introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, fine-tuned on 35B Python tokens to create StarCoder. StarCoder outperforms other models in code generation. Legal and privacy concerns around LLMs are discussed, emphasizing the need for transparency. StarCoder models are released under an OpenRAIL-M license agreement for safe model release.