Active Learning Slashes Data Annotation Costs for Math Reasoning AI
A team from the National University of Singapore, Sea AI Lab, and Singapore Management University has developed a novel active learning approach called ACTPRM, significantly reducing the annotation costs associated with training Process Reward Models (PRMs) for large language models (LLMs) used in mathematical reasoning. Their findings, published in a recent paper, demonstrate substantial improvements in efficiency without sacrificing performance.
PRMs provide step-by-step feedback to LLMs as they tackle complex mathematical problems. Traditionally, training these models requires extensive and expensive annotation of each step in a solution, indicating whether it’s correct or incorrect. This annotation process becomes a bottleneck, especially when dealing with the vast datasets needed to train modern LLMs.
ACTPRM addresses this by selectively annotating only the most uncertain steps in a problem-solving process. It leverages an ensemble of PRMs to estimate the uncertainty associated with each step. For instance, imagine an LLM is solving a calculus problem. The ACTPRM system examines each step and determines if the ensemble of PRMs agrees on the correctness of the step. If there’s a high degree of uncertainty, it’s flagged for annotation; otherwise, it’s skipped. A highly capable reasoning model (another LLM) then labels the uncertain steps, and the PRM is updated based on this new, targeted data.
The researchers demonstrated that ACTPRM can reduce annotation costs by as much as 50% while maintaining comparable or even better performance than training on fully annotated datasets. In one experiment, ACTPRM outperformed previous state-of-the-art models on the ProcessBench dataset while using only 20% of their annotation budget.
To further illustrate the gains, consider the task of verifying a mathematical proof generated by an LLM. A traditional approach requires annotating every line of the proof to determine correctness. With ACTPRM, the system focuses only on steps where the model exhibits the most uncertainty, significantly reducing the human effort required. The paper’s authors note the average annotation costs for the baseline “UniversalPRM” are 232 generated tokens, compared to ~4.8x10^9 tokens for ACTPRM.
The team further improved their actively trained PRM by applying it to a massive dataset of over one million math reasoning trajectories. By filtering out the most certain trajectories, they retained 60% of the data and then retrained the PRM. The resulting model achieved new state-of-the-art performance on both ProcessBench (75.0%) and PRMBench (65.5%) datasets.
The code and models are publicly available, furthering reproducibility and potential integration into other research.
This work opens the door for more efficient training of PRMs, making it more feasible to scale up LLMs for mathematical reasoning and potentially other complex tasks.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.