AI Papers Reader

Personalized digests of latest AI research

View on GitHub

New Benchmark "SWE-Perf" Challenges LLMs to Optimize Real-World Code Performance

San Diego, CA - While Large Language Models (LLMs) have shown remarkable progress in tasks like code generation and bug fixing, their ability to optimize code performance in complex, real-world software projects has remained largely untested. To bridge this gap, researchers have introduced SWE-Perf, a novel benchmark designed to systematically evaluate LLMs on code performance optimization within authentic repository contexts.

SWE-Perf comprises 140 carefully curated instances, each derived from performance-enhancing “pull requests” (PRs) submitted to popular GitHub repositories. These PRs represent genuine attempts by developers to speed up their code. Each benchmark instance includes the original codebase, specific “target functions” to focus on, performance-related tests, the expert-authored patch that improved performance, and a complete, executable environment.

The Problem: LLMs vs. Real-World Performance Optimization

Optimizing code performance is crucial for production systems, often yielding significant system-wide benefits. However, it’s a task that traditionally demands specialized human expertise and involves intricate, cross-file, and cross-module modifications. Existing benchmarks often focus on isolated function-level optimizations for algorithmic problems, failing to capture the complexity of optimizing entire software repositories. This is akin to asking a model to fix a single sentence in a book versus improving the overall readability and flow of the entire novel.

For example, a performance improvement might involve not just changing a single line of code in one function, but rather restructuring how different parts of the software interact, optimizing data handling across multiple modules, or even tweaking how external libraries are used. This is where LLMs have historically struggled, as it requires a deeper understanding of the entire system’s architecture and interdependencies.

SWE-Perf: A Rigorous Evaluation Framework

The SWE-Perf benchmark addresses this challenge by providing a structured pipeline for evaluating LLMs:

  1. Data Collection: The process begins by gathering PRs from popular GitHub repositories. These PRs are then rigorously filtered to identify those that demonstrably improve performance. For instance, a PR might be selected if it significantly reduces the execution time of a critical function.
  2. Performance Measurement: For each selected PR, the original and modified codebases are tested using unit tests. The execution times of these tests are recorded to quantify the performance gains. To ensure reliability, each test is run multiple times, and the environment is standardized using Docker containers.
  3. Verification: Crucially, SWE-Perf ensures that the identified performance improvements are stable and statistically significant. This involves re-running tests multiple times, filtering out outliers, and using statistical methods to confirm that the optimized code is consistently faster.
  4. Target Extraction: Finally, specific “target functions” are identified from the expert-authored patches. These targets are presented to the LLMs in two settings:
    • Oracle (File-Level): Here, the LLM is given the specific function that was optimized and its surrounding code files, mimicking a scenario where a developer knows exactly which part of the code needs improvement.
    • Realistic (Repo-Level): This setting simulates a more natural, end-to-end scenario where the LLM acts as an autonomous agent, navigating the entire repository to find and implement performance optimizations.

Key Findings: A Significant Capability Gap

The initial evaluation of leading LLMs on SWE-Perf revealed a substantial performance gap between the models and human experts. Even sophisticated models, when tasked with optimizing code in real-world repositories, showed significant limitations in achieving the same level of performance enhancement as demonstrated by human-written patches.

For example, while some models could apply patches and maintain code correctness, their performance improvements were often modest, sometimes less than 1-2% on average, compared to the experts’ 10-30% gains. This suggests that LLMs are still learning to effectively navigate the complexities of repository-wide optimization, which involves understanding broader architectural patterns and interdependencies, rather than just localized code fixes.

However, the research also highlights promising areas. Some agent-based LLM approaches, which can perform iterative reasoning, showed superior performance compared to simpler, direct optimization methods. This indicates that the path forward for LLM-driven code optimization may lie in more sophisticated, agent-like strategies that mimic human problem-solving processes.

SWE-Perf provides a critical new tool for researchers and developers looking to push the boundaries of AI in software engineering, specifically in the vital area of code performance. The benchmark aims to accelerate progress in creating AI systems that can deliver meaningful, scalable performance enhancements for production software.