AI Papers Reader

Personalized digests of latest AI research

View on GitHub

SWE-rebench: New Dataset and Benchmark Aims to Revolutionize Software Engineering Agent Development

Large Language Models (LLMs) are showing promise in automating software engineering (SWE) tasks, but two major challenges are hindering progress: a lack of high-quality training data and rapidly-outdating benchmarks. To address these issues, researchers at Nebius have introduced SWE-rebench, an automated pipeline and associated dataset designed to continuously extract and evaluate real-world interactive SWE tasks from GitHub.

The core of SWE-rebench is a novel, scalable pipeline that automatically mines GitHub repositories for relevant tasks, configures executable environments, and validates solutions. This pipeline leverages LLMs to identify potential installation instructions, extract recipes, and refine configurations based on error logs. By automating this process, SWE-rebench can continuously gather fresh, diverse tasks.

The resulting SWE-rebench dataset comprises over 21,000 interactive Python-based SWE tasks. Unlike previous datasets that are either limited to one-shot code generation or manually curated, SWE-rebench focuses on real-world scenarios where agents must interact with development environments, execute code, and adapt based on the outcomes of their actions. This makes it suitable for reinforcement learning of SWE agents at scale.

To build intuition for this, consider a scenario where an agent needs to fix a bug in a popular open-source library. Instead of simply generating code, the agent must first understand the issue from a GitHub issue description, configure the library’s environment, run tests to reproduce the bug, modify the code, and then run tests again to ensure the fix works and doesn’t introduce new issues. SWE-rebench provides tasks with all of these steps in place.

Furthermore, the researchers created a contamination-free benchmark for agentic software engineering. Using the automated SWE-rebench methodology, they continuously supply fresh tasks to this benchmark. By comparing the performance of various LLMs on this benchmark with their performance on SWE-bench Verified, they found that the performance of some language models might be inflated due to contamination issues.

For example, an open-source model may perform well on the static SWE-bench dataset due to exposure during training. However, when evaluated on a fresh set of tasks from SWE-rebench, its performance may decline, indicating that it had memorized the training data rather than learned to generalize.

The SWE-rebench dataset and benchmark are publicly available. They aim to promote transparency, fair comparisons, and accelerated progress in the field of LLM-based software engineering.