AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Large Language Models Struggle with Multi-Step Code Generation Tasks, New Research Shows

A new benchmark reveals limitations in LLMs’ ability to solve complex problems requiring multiple code generation steps.

Large language models (LLMs) have made impressive strides in code generation, but a new paper published in arXiv reveals a significant weakness: they often struggle when tasked with generating code that involves multiple steps of reasoning and self-invocation. Researchers Zhaojian Yu, Yilun Zhao, Arman Cohan, and Xiao-Ping Zhang introduce a novel benchmark, evaluating LLMs’ performance on self-invoking code generation. This task differs fundamentally from traditional benchmarks by requiring models not only to generate code for a single function but also to utilize that generated code to solve a more complex, related problem.

Think of it like this: a traditional code generation benchmark might ask an LLM to write a function to add two numbers. A self-invoking task would go further, asking the LLM to first write a function to add two numbers, and then use that function as a building block to create a more complicated function that adds a list of numbers. This requires the model to demonstrate not only coding proficiency but also an understanding of how to use modular code effectively, mirroring real-world software development challenges.

The researchers created three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, all variations of existing benchmarks designed to evaluate self-invoking code generation capabilities. Their analysis of over twenty LLMs, including both open-source and proprietary models, revealed a consistent pattern: while models achieved high accuracy on traditional benchmarks (e.g., 96.2% pass@1 on HumanEval for the o1-mini model), their performance declined significantly on self-invoking tasks (dropping to 76.2% on HumanEval Pro for the same model).

This drop in performance highlights that simply scaling up models or using instruction tuning isn’t enough to overcome this limitation. The instruction-tuned models showed only marginal improvements compared to their base models in self-invoking tasks. This suggests that future research needs to go beyond the simple instruction-tuning approach to tackle the challenge of multi-step code generation.

The paper also delves into the failure modes observed in the LLMs’ attempts at self-invoking code generation. Common errors included incorrect handling of variables, logic errors in the self-invoked code, and failure to correctly utilize the initially generated function. A detailed analysis of these failure modes, along with examples, helps to clarify the specific areas where current LLMs fall short.

Importantly, the researchers also explored the use of chain-of-thought (CoT) prompting, a technique that guides LLMs through problem-solving by explicitly outlining the reasoning steps. While CoT did improve performance, it didn’t fully bridge the gap between traditional code generation and self-invoking tasks, suggesting there’s more work to be done.

The study underscores the need for benchmarks that more accurately reflect the complexities of real-world programming tasks. The self-invoking code generation paradigm presented in this paper provides a promising avenue for evaluating and improving the more advanced reasoning abilities required for true code mastery by LLMs. It’s a crucial step towards LLMs becoming more effective and reliable tools for developers.