Canlanguage Models Resolve Real-World Githubissues?

By Carlos E. Jimenez et al
Published on April 5, 2024
Read the original document by opening this link in a new tab.

Table of Contents

1. Introduction
2. SWE-bench
3. Task Formulation
4. Experimental Setup
5. Results

Summary

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.
×
This is where the content will go.