CodeARC: A New Challenge to Test LLMs' Reasoning Skills in Code Synthesis
A team of researchers from Stanford University, the University of Illinois Urbana-Champaign, MIT, Intel, and Visa Research has introduced a new benchmark called CodeARC (Code Abstraction and Reasoning Challenge) to rigorously evaluate the reasoning capabilities of Large Language Model (LLM) agents in inductive program synthesis – the task of writing code from input-output examples. This research addresses a gap in how we test LLMs, which often rely on natural language guidance and static examples, failing to fully capture real-world challenges like reverse engineering.
Unlike existing benchmarks, CodeARC presents an interactive environment where LLM agents must infer the underlying logic of a hidden function by:
- Querying the function with new inputs to generate input-output pairs.
- Synthesizing candidate functions based on the observed inputs and outputs.
- Using a “differential testing oracle” – a special tool that identifies discrepancies between the synthesized function and the true, hidden function.
- Iteratively refining their solution based on the feedback from the oracle.
Think of it like trying to reverse-engineer a piece of software. You don’t have the source code, but you can interact with it, see how it responds to different inputs, and use that information to build your own equivalent version. The differential testing oracle helps the LLM identify edge cases and subtle bugs in its synthesized code, guiding it towards a more accurate solution.
The CodeARC benchmark includes a large and diverse set of 1114 general-purpose Python functions. This makes it more challenging than existing benchmarks that focus on specific domains or tasks. To illustrate the difficulty, even the best-performing model, OpenAI’s 03-mini
, only achieved a success rate of 52.7% on the benchmark.
The researchers also explored “fine-tuning” LLMs – further training them on specially created data to improve their performance. They found that fine-tuning LLaMA-3.1-8B-Instruct, a popular open-source LLM, on curated synthesis traces that captured both function calls and reasoning steps, resulted in up to a 31% relative performance improvement. This shows that targeted training can significantly boost LLMs’ ability to reason about code.
For example, an LLM might be given the following input-output pairs:
1.5 -> 2
2.4 -> 2
-2.6 -> -3
The LLM has to figure out that the hidden function is rounding numbers to the nearest integer, but with a twist: it rounds half-values away from zero.
CodeARC represents a significant step forward in how we evaluate LLMs, providing a more realistic and challenging testbed for assessing their inductive reasoning and program synthesis skills. It highlights the areas where LLMs still struggle and provides valuable insights into how we can improve their ability to reason about code.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.