AI Papers Reader

Personalized digests of latest AI research

View on GitHub

LLMs Struggle to Replicate Research Results: New Benchmark Unveils Challenges

Research Results Are Difficult to Replicate

A core principle of science is reproducibility. Yet, replicating research findings in machine learning (ML) and natural language processing (NLP) often proves challenging. This is because research code is frequently difficult to run, requiring substantial effort to set up and execute. Researchers often face difficulties like installing the correct environment, resolving outdated dependencies, fixing bugs, and determining the correct execution commands. These steps are especially time-consuming, as support and documentation may be incomplete or even unavailable.

Can LLMs Automate This Process?

Recent progress in large language models (LLMs) has led to advancements in automating code tasks, such as writing code, editing code, and even resolving GitHub issues. But can LLMs go beyond these specific tasks and autonomously replicate research results from existing repositories?

SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories

To assess the capability of LLMs to perform these tasks, researchers at the Allen Institute for AI have developed a new benchmark called SUPER (Setting UP and Executing tasks from Research repositories). SUPER aims to capture the realistic challenges faced by researchers working with ML and NLP research repositories.

Three Distinct Problem Sets

The benchmark consists of three distinct problem sets:

  1. Expert Set: Contains 45 manually curated problems solved by experts. Each task involves reproducing a research result from a specific repository.
  2. Masked Set: Includes 152 sub-problems derived from the Expert set by removing specific parts of the expert-written code. Each sub-problem targets a particular challenge, such as installing dependencies, configuring experimental data, setting up hyperparameters, resolving runtime exceptions, or correctly executing scripts.
  3. Auto Set: Contains 604 automatically generated problems, designed for larger-scale development and fine-tuning, using environment feedback.

Challenges Faced by LLMs

The results from the benchmark highlight the significant challenges faced by current LLM-based agents in solving these tasks:

SUPER: A Valuable Resource for the Research Community

SUPER serves as a valuable resource for the research community:

Conclusion

While LLMs have made significant progress in code-related tasks, SUPER reveals that autonomously replicating research results from existing repositories still presents a significant challenge. The benchmark highlights the need for further advancements in LLM-based agents, particularly in areas such as repository reasoning and code editing. Future research efforts focused on these areas will be crucial for developing agents capable of reliably reproducing research results.