SWE-Gym: A New Benchmark for Training and Evaluating Software Engineering Agents
A team of researchers from UC Berkeley, UIUC, CMU, and Apple have introduced SWE-Gym, the first publicly available environment for training and evaluating real-world software engineering (SWE) agents. Their findings, published in a preprint on arXiv, demonstrate significant advancements in the field, achieving new state-of-the-art results on popular SWE benchmarks.
SWE agents, typically language models (LMs), aim to automate software development tasks. Previous progress has been hampered by a reliance on proprietary models and limited access to high-quality training data. SWE-Gym directly addresses these limitations. It consists of 2,438 real-world Python programming tasks sourced from 11 popular open-source repositories on GitHub. Each task includes:
- A codebase with pre-installed dependencies.
- Executable unit tests.
- A natural language description of the task (drawn from the original GitHub issue report).
This contrasts sharply with existing datasets like SWE-Bench, which only provide solutions (git patches) without executable environments or reward signals. SWE-Gym provides a more realistic and complete training environment, enabling researchers to train LMs that truly understand and interact with a complex software development workflow. For researchers who might not have access to powerful hardware, the authors included SWE-Gym Lite – a smaller, faster subset of 230 easier tasks for quicker experimentation. They also released a raw dataset, SWE-Gym Raw, consisting of 66,894 instances for future research into automated dataset synthesis.
The researchers trained state-of-the-art open-weight SWE agents using SWE-Gym. They fine-tuned a 32B parameter Qwen-2.5 coder model, using only 491 trajectories (sequences of agent actions and environment responses) sampled from SWE-Gym. This resulted in substantial improvements: +12.3% on SWE-Bench Lite and +13.6% on SWE-Bench Verified (absolute gains).
Furthermore, SWE-Gym facilitates inference-time scaling through the use of trained verifiers. These verifiers, also LMs trained on agent trajectories, estimate the probability of success for a given solution. By sampling multiple solutions and selecting the best based on the verifier’s prediction, the researchers achieved even higher performance: 32.0% on SWE-Bench Verified and 26.0% on SWE-Bench Lite – setting a new state-of-the-art for publicly available SWE agents.
The study also explored scaling aspects of both training and inference. Training performance consistently improved with more training trajectories, showing no signs of saturation. Similarly, inference accuracy improved logarithmically with the number of candidate solutions evaluated by the verifier. Different sampling strategies (random, unique instance, and repository-based) were compared, with the results suggesting the repository diversity of SWE-Gym still offers room for improvement rather than being a performance bottleneck.
SWE-Gym’s open-source nature and comprehensive dataset significantly advance the field of SWE agent research. The detailed experiments conducted by the researchers provided valuable insights and baselines, highlighting the effectiveness of using real-world data for training and the importance of verifier models for scalable inference. The availability of SWE-Gym, models, and agent trajectories promises to accelerate innovation and further development in automated software engineering.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.