AI Papers Reader

Personalized digests of latest AI research

View on GitHub

AI Agents Master Software Environments, Cutting Build Times by 43%

A new framework, MEnvAgent, promises to break the bottleneck in training AI software engineering (SWE) agents by automating the notoriously complex and slow task of constructing verifiable, polyglot development environments. Developed by researchers, the multi-agent system drastically improves both the success rate and efficiency of preparing real-world bug-fixing tasks.

Traditionally, setting up a specific environment for a software bug—complete with the correct operating system, dependencies, and toolchain (like Python 3.8, Java Maven, or specific C++ compilers)—is a manual, error-prone, and time-intensive process. This brittleness severely limits the scale and diversity of training data available for modern Large Language Models (LLMs).

MEnvAgent tackles this by leveraging a multi-agent, three-stage workflow: Planning, Execution, and Verification. Crucially, the system introduces a novel Environment Reuse Mechanism that minimizes computational cost.

The Power of Incremental Patching

Instead of rebuilding a unique Docker container from scratch for every new bug, MEnvAgent first searches a pool of historical, successfully verified environments for the closest match. If a similar environment is found, an intelligent EnvPatchAgent diagnoses the differences and generates only the minimal incremental command sequence (a “patch”) needed to adapt the old environment to the new task’s requirements.

For example, imagine a Java project that needs an environment with JDK 17 and specific build dependencies. A new bug fix requires only one additional library. Instead of spending hours re-compiling and installing the entire JDK and existing dependencies (the “scratch” approach), MEnvAgent finds the historical JDK 17 environment and generates a patch that simply adds the missing library (e.g., pip install missing-dependency). This technique bypasses the heavy costs of full rebuilds.

New Benchmark and Performance Gains

To rigorously test this framework, the team created MEnvBench, a comprehensive benchmark comprising 1,000 tasks across 10 mainstream programming languages, including Python, Java, Go, Rust, and C++.

Evaluation results show MEnvAgent significantly outperforms state-of-the-art baselines, occupying the optimal efficiency-quality quadrant in performance charts. Averaged across all models and languages, MEnvAgent improved the strict Fail-to-Pass (F2P) rate—a metric ensuring the environment accurately reproduces the bug (Fail) and verifies the fix (Pass)—by 8.6%. More strikingly, the system reduced the average Time Cost per task by 43.0%.

The researchers further demonstrated the framework’s scalability by constructing MEnvData-SWE, the largest open-source polyglot dataset of verifiable Docker environments to date, totaling 3,005 instances from 942 real-world GitHub repositories. Fine-tuning existing LLMs on this highly verified data led to substantial performance boosts on downstream SWE tasks, confirming the utility of MEnvAgent as a foundational tool for advanced AI development in software engineering.