New Benchmark Tackles LLM Math Reasoning Skills

🔊

💬 Ask

A new benchmark, Putnam-AXIOM, has been introduced to more rigorously assess the advanced mathematical reasoning capabilities of large language models (LLMs). Developed by researchers, this benchmark leverages problems from the prestigious William Lowell Putnam Mathematical Competition, a notoriously challenging exam for undergraduate students.

The current landscape of LLM benchmarks is showing signs of saturation, with models achieving high accuracy scores on existing tests. This has raised concerns about “data contamination,” where models might be memorizing answers from the training data rather than truly understanding the underlying mathematical concepts.

To combat this, Putnam-AXIOM not only includes 522 original problems from the competition but also introduces “functional variants.” These variants are created by programmatically altering variables and constants within the original problems. This approach generates an almost unlimited stream of unique, yet equally difficult, problems that are unlikely to have been encountered during LLM training. For example, a problem might involve a specific number like “2011,” and its variant could change this to “4680,” requiring the LLM to re-apply its reasoning skills to a slightly different but mathematically equivalent structure.

The research team found that leading LLMs, including OpenAI’s ol-preview, scored around 41.9% on the original Putnam-AXIOM problems. However, this accuracy dropped significantly, by nearly 20%, when tested on the functional variants. This substantial decrease across multiple models suggests a reliance on memorization rather than genuine problem-solving.

Beyond just final answers, Putnam-AXIOM also introduces “Teacher-Forced Accuracy” (TFA) as a novel evaluation metric. TFA assesses how well an LLM predicts each step of a solution, given the correct prior steps. This provides a more granular view of the model’s reasoning process, moving beyond simply checking the final “boxed” answer. TFA has been shown to correlate well with final answer accuracy and is less computationally intensive than other methods that evaluate the entire reasoning trace.

The researchers believe that Putnam-AXIOM and its associated metrics will provide a more robust and contamination-resistant way to measure LLM progress in complex mathematical reasoning, guiding future development towards models that can truly understand and solve challenging problems. The dataset and code are publicly available for further research.

AI Papers Reader

Personalized digests of latest AI research

New Benchmark Tackles LLM Math Reasoning Skills

Chat about this paper