AI Papers Reader

Personalized digests of latest AI research

View on GitHub

SWE-smith: A New Toolkit to Massively Scale Software Engineering Datasets for AI

In a bid to overcome the bottleneck of limited training data for AI-driven software engineering (SE) agents, a team of researchers has unveiled SWE-smith, a novel pipeline designed to generate high-quality training datasets at an unprecedented scale. This new toolkit promises to democratize research in automated software engineering by significantly lowering the barrier to entry.

Current datasets for training Language Models (LMs) in SE are notably small, often comprising only a few thousand instances sourced from a handful of GitHub repositories. The creation of these datasets is typically a laborious process, demanding hundreds of hours of human effort and terabytes of storage. SWE-smith tackles this problem head-on by providing an automated solution.

SWE-smith operates by first constructing a dedicated execution environment for any given Python codebase. It then automatically synthesizes hundreds or even thousands of task instances designed to deliberately break existing tests within the codebase. This ingenious approach ensures that the generated data is both challenging and relevant to real-world software development scenarios.

To illustrate the power of SWE-smith, consider a scenario where a software engineering team is developing a function to calculate shipping costs for an e-commerce platform. SWE-smith could automatically generate numerous test cases by introducing subtle errors in the code, such as incorrectly applying discount codes, failing to account for taxes, or mishandling shipping rates for specific regions. These errors would then be used to train an AI agent to identify and correct similar bugs in other parts of the codebase.

Using SWE-smith, the researchers assembled a dataset of over 50,000 instances drawn from 128 GitHub repositories. This dataset is an order of magnitude larger than any previously existing dataset in the field. To showcase the dataset’s efficacy, the team trained SWE-agent-LM-32B, a 32-billion-parameter language model, achieving a new state-of-the-art performance on the SWE-bench Verified benchmark, an industry standard for evaluating software engineering AI agents.

SWE-smith’s key innovation lies in its ability to automatically synthesize bugs in existing GitHub repositories, such as (1) generating errant rewrites of functions with an LM, (2) procedurally modifying the abstract syntax tree (AST) of functions, (3) undoing PRs, and (4) combining bugs. By using these methods the toolkit creates training data for software engineering agents.

The researchers are making SWE-smith openly available, offering not only the toolkit itself but also the generated task instances, execution environments, and trained models. This open-source approach aims to fuel further research and development in automated software engineering, enabling the creation of more powerful and accessible AI-driven tools for software developers.