Divide and Conquer: A New Way to Teach AI Accuracy and Honesty
Large Language Models (LLMs) are increasingly capable of solving complex math problems, but they often suffer from a persistent personality flaw: they don’t know when they are wrong. When researchers try to train models to be both accurate and self-aware, they often run into a “tug-of-war” where teaching the model a new skill, like estimating its own confidence, accidentally degrades its ability to solve the original problem.
A new paper from researchers at Nebius and The Humanoid proposes a more surgical approach. Instead of grading an AI’s entire response with a single score, their method, called Blockwise Advantage Estimation (BAE), breaks the text into segments and rewards each part for its specific goal.
The Credit Assignment Problem
To understand the breakthrough, imagine a student taking a two-part exam. Part one is a difficult calculus problem; part two asks the student to rate their confidence in their answer from 0 to 100%.
In standard reinforcement learning—specifically a popular method called Group Relative Policy Optimization (GRPO)—the AI is graded on the “vibe” of the entire response. If the AI gets the math right but gives a nonsensical confidence score, it receives a mediocre grade. The AI then struggles to figure out which part it messed up. Did it fail at the math or the self-reflection? This “misattributed credit” often leads to “reward hacking,” where the model finds weird shortcuts to get a higher score without actually getting smarter.
Segmenting Success
The researchers’ core insight is that LLM tasks are naturally “segmented.” In their framework, the math solution is one block, and the confidence report is another. By using BAE, the researchers can update the “math” tokens based solely on whether the answer was correct, and update the “confidence” tokens based on how well that score tracked with the actual outcome.
However, a technical hurdle remained. To judge if a confidence score is “good,” you need a baseline—a sense of how confident a model should be at that specific moment. Usually, calculating this baseline requires the computer to pause and simulate thousands of potential “alternate futures” from that exact midpoint, a process that is prohibitively expensive and slow.
The “Outcome-Conditioned” Shortcut
The team’s solution is the Outcome-Conditioned Baseline (OCB). Instead of simulating new data, it looks at the group of samples the model just generated. It stratifies these samples based on what happened in the first block.
Think of it like coaching a tennis player on their second serve. You wouldn’t compare the quality of their second serve to every serve ever made; you would only compare it to other second serves where the first serve was already a fault. By comparing “wrong-answer” confidence scores only against other “wrong-answer” scores, the model gains a much clearer signal of what honesty looks like.
Better Calibration, Less Compute
The results are striking. On math benchmarks like MATH500 and GSM8K, the BAE method matched the performance of state-of-the-art systems that require “hand-designed” reward formulas, which are notoriously difficult to tune.
More importantly, the models became significantly better “calibrated.” When the model said it was 90% sure, it was actually right about 90% of the time. This honesty is vital for “test-time scaling”—a technique where a model generates multiple answers and uses its own confidence score to pick the best one.
By moving away from “one-size-fits-all” rewards, this research provides a modular recipe for training AI to handle multi-step reasoning, self-reflection, and complex agentic tasks without the massive computational overhead of traditional methods.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.