New Tool Creates Vast Reservoir of Realistic Machine Learning Tasks
San Francisco, CA – October 8, 2025 – Researchers have developed an automated system called MLE-Smith that can generate a massive and diverse collection of machine learning engineering (MLE) tasks, overcoming a significant bottleneck in the development of advanced AI agents. Current MLE benchmarks often rely on manually curated tasks, which are time-consuming to create, limited in scalability, and may not fully reflect real-world complexities.
MLE-Smith addresses these limitations with a novel “generate-verify-execute” pipeline. This system transforms raw datasets into competition-style MLE challenges, ensuring each task is structurally sound, semantically coherent, and practically solvable by AI agents.
For example, imagine a raw dataset containing information about different electric vehicles, including their specifications and performance metrics. MLE-Smith could take this raw data and, through its automated pipeline, generate a task that challenges an AI agent to predict a vehicle’s range based on its features. This task would not only require the AI to process the raw data and build a predictive model but also to adhere to specific evaluation metrics and data formatting requirements, mimicking a real-world machine learning competition.
The system employs a multi-agent approach:
- Brainstormer: This agent explores a given dataset and proposes multiple potential task formulations, considering various learning objectives and modeling strategies. It aims to identify diverse and meaningful challenges that can be derived from the data.
- Designer: Taking a proposed task formulation, this agent instantiates a complete, end-to-end executable MLE task. This involves defining data preprocessing, creating training and testing splits, specifying input/output schemas, and generating evaluation scripts.
- Refactor: This agent standardizes all generated tasks into a unified format, ensuring consistency across the entire benchmark.
A crucial aspect of MLE-Smith is its robust “hybrid verification mechanism.” This multi-layered approach includes:
- Assertions: These are deterministic checks that ensure structural integrity, verifying file formats, directory layouts, and adherence to defined schemas.
- Reviews: An LLM-based agent reviews the tasks for semantic soundness, assessing clarity of descriptions, appropriateness of metrics, and whether the task encourages genuine AI agent behavior.
- Execution-based Validation: The system runs each generated task within an interactive MLE environment to empirically validate its solvability and ensure it provides meaningful signals for AI agent performance.
The researchers applied MLE-Smith to 224 real-world datasets, successfully generating 606 diverse tasks. These tasks span a wide range of data modalities (e.g., tabular, image, audio), learning objectives (e.g., classification, regression), and domains (e.g., healthcare, sports).
Evaluations using cutting-edge Large Language Models (LLMs) on these newly generated tasks showed a strong correlation with their performance on existing, human-designed benchmarks. This indicates that MLE-Smith can effectively scale the creation of MLE tasks while maintaining their quality, realism, and discriminative power, paving the way for more robust evaluation and development of future AI agents.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.