AI Papers Reader

Personalized digests of latest AI research

View on GitHub

AI, Automate Thyself: MLEvolve Outperforms Human-Built Coders in Half the Time

Imagine a software developer tasked with building a complex artificial intelligence system. They write some code, test it, note what failed, borrow ideas from colleagues, and slowly refine the project.

Up to now, AI agents trying to automate this “Machine Learning Engineering” (MLE) process have operated more like isolated, forgetful amateurs. They work in silos, forget what they tried ten minutes ago, and rewrite entire codebases from scratch to fix a minor bug.

Now, a new framework called MLEvolve is changing this. Developed by researchers at the Shanghai Artificial Intelligence Laboratory and East China Normal University, MLEvolve is a self-evolving multi-agent system designed to autonomously discover, write, and refine machine learning algorithms over long, complex horizons.

Breaking the Search Silos

To bypass the bottlenecks of older AI coders, MLEvolve introduces three core innovations:

  1. Progressive Monte Carlo Graph Search (MCGS): Traditional systems explore code adjustments in linear, isolated branches. MCGS acts like a collaborative brainstorming room, building “reference edges” that allow promising techniques developed in one branch to be instantly shared with another.
  2. Retrospective Memory: This pairs a textbook-like domain knowledge base for quick starts with a dynamic global diary. The system automatically logs every plan, success, and error without needing extra, expensive computing power to “reflect” on its choices.
  3. Hierarchical Planning: MLEvolve separates the “what to do” from the “how to do it.” A high-level Planner maps out module changes, while a Coder implements them. Crucially, the Coder uses adaptive modes—such as a surgical “Diff” mode for minor patch edits—rather than needlessly rewriting functional code from scratch.

MLEvolve in Action: Fusing Best Ideas

How does this look in practice? Consider a real-world Kaggle competition task: diagnosing blindness from medical images.

During an actual run, MLEvolve’s agent tried standard image-regularization tricks but hit a performance wall. Rather than repeatedly guessing, the agent used Intra-branch evolution to scan its own historical attempts. It diagnosed the issue as an architectural bottleneck and successfully fused a modern DINOv3 model backbone with a classic ResNet50 neural network.

In another task involving bird vocalization, the agent got stuck using a standard loss function. Through Cross-branch reference, it peeked at a successful parallel run, identified an alternative “Asymmetric Loss” design, and surgically swapped it in using its Diff editing mode. When overall progress stalls, Multi-branch aggregation can even blend the best elements of multiple separate attempts—such as a specific feature filter from Branch A and a validation strategy from Branch B—to launch a brand-new, hybrid strategy.

Unprecedented Efficiency

The results are striking. Tested on OpenAI’s challenging MLE-Bench—a benchmark of 75 Kaggle data science competitions—MLEvolve achieved a state-of-the-art 65.3% average medal rate.

Remarkably, it accomplished this under a strict 12-hour budget (half the standard 24-hour limit of its competitors) while maintaining a perfect 100% valid code submission rate. It also demonstrated cross-domain versatility, outperforming specialized math agents on complex algorithmic optimization tasks.

By proving that AI can successfully manage its own trial-and-error process, MLEvolve brings us one step closer to truly autonomous, self-improving scientific discovery.