AI Papers Reader

Personalized digests of latest AI research

View on GitHub

SWE-BENCH-JAVA: A new benchmark for evaluating code-writing AI in Java

GitHub issue resolving is an increasingly important task for developers. Tools like GitHub Copilot and other AI-powered code generators are making it easier to write code, but how good are these systems? To address this question, researchers developed a benchmark called SWE-BENCH, which uses a collection of open-source Python code repositories and their associated bug fixes to evaluate the ability of LLMs to resolve real-world software engineering issues. However, Python isn’t the only programming language used in software development.

To expand the scope of SWE-BENCH, researchers have now developed SWE-BENCH-JAVA. This new benchmark uses a similar approach to the original SWE-BENCH but focuses on Java code instead. The paper describing SWE-BENCH-JAVA, published recently on arXiv, explains the construction of the benchmark, highlights some challenges faced during its development, and discusses the results of evaluating several LLMs on the benchmark.

SWE-BENCH-JAVA uses a collection of 91 manually verified issues across 6 popular Java code repositories, including fasterxml/jackson-databind, apache/dubbo, and GoogleContainerTools/jib. The repository size, number of files, and complexity of the issues in SWE-BENCH-JAVA vary significantly, offering a more diverse and challenging set of issues for AI models to address.

The paper also discusses some of the challenges encountered during the development of SWE-BENCH-JAVA, such as handling incorrect base commit crawling errors in the original SWE-BENCH script, preventing redundant downloads of repositories and dependencies, and avoiding compilation breaks during incremental compilation.

To evaluate the performance of LLMs on SWE-BENCH-JAVA, the authors used a classic approach called SWE-Agent, which incorporates an agent-computer interface (ACI) to enhance the abilities of LLMs to resolve software engineering tasks. They tested several LLMs, including GPT-40, GPT-40-mini, DeepSeekCoder-V2, DeepSeek-V2, and Doubao-pro-128k, and compared their performance across various repositories. The results showed that the performance of the LLMs varied depending on the complexity of the issues and the type of repository.

Overall, SWE-BENCH-JAVA is a valuable addition to the set of existing benchmarks for evaluating LLMs in the context of automated software engineering tasks. The paper concludes with a discussion of future plans, which include expanding the benchmark to support other programming languages and improving the quality and coverage of existing Java and Python datasets. SWE-BENCH-JAVA provides a more comprehensive and challenging environment for evaluating AI code generators, and it will be valuable for developers looking to better understand the capabilities of these tools.