AI Teaching Itself: New Framework Helps Multimodal Models Break Free from Human Supervision
In the world of artificial intelligence, a significant “data bottleneck” has long frustrated researchers. To teach Multimodal Large Language Models (MLLMs)—the AIs that can see and reason about images—to solve complex math or logic problems, humans usually have to painstakingly label thousands of examples or use even larger, more expensive “teacher” models to grade them.
However, a team of researchers from institutions including OPPO AI Center and Tsinghua University has unveiled a new training framework that allows AI to improve its own reasoning without any human-annotated answers or external help. Published in a recent paper, the method, called “Unsupervised Self-Evolution,” allows models to effectively “fact-check” their own logic, leading to massive performance gains on complex benchmarks.
The Problem with “Majority Rule”
Until now, the most common way to let an AI teach itself was through a technique called “majority voting.” If you ask a model a difficult geometry question ten times, and it gives the same answer eight times, the system assumes that answer is correct and trains itself to favor it.
The problem, the researchers point out, is that AI models can be confidently wrong. If a model has a systematic bias, majority voting simply reinforces its mistakes, leading to a “collapse” where the model becomes more certain of its errors.
For example, imagine asking an AI to calculate an angle in a rhombus. A standard model might repeatedly use the wrong formula—perhaps incorrectly assuming the diagonals are equal—and reach the same wrong answer multiple times. Under old self-training methods, the model would “learn” this incorrect shortcut simply because it was consistent.
The “Actor” and the “Judge”
The new framework solves this by splitting the AI’s personality into two roles: an Actor and a Judge.
When presented with a visual puzzle—like finding the area of a square embedded in a larger quadrilateral—the Actor generates several different “trajectories,” or step-by-step reasoning paths. Instead of just looking at the final answer, a “Judge” (a frozen version of the model itself) evaluates the quality of each reasoning step.
The Judge looks for “visual grounding”—checking if the model is actually looking at the right part of the image—and “reasoning quality.” Even if the Actor only finds the correct answer in one out of eight attempts, the Judge can recognize that the logic in that single attempt was superior. It then rewards that specific path, reshaping the model’s “inner monologue” to prefer high-quality logic over simple consensus.
Grading on a Curve
To make the training stable, the researchers used a technique called Group Relative Policy Optimization (GRPO). Rather than trying to reach an absolute “perfect” score (which is hard to define without a human answer key), the model compares its various attempts against each other.
It’s essentially “grading on a curve.” By looking at a group of its own responses to the same image, the model identifies which ones are the “relative best.” This allows the AI to gradually shift its probability toward better reasoning without ever seeing a ground-truth answer.
Results and the Path Ahead
The results were striking. On MathVision, a notoriously difficult benchmark for visual mathematical reasoning, the framework boosted the model’s accuracy from 25% to 30.9%—a nearly 6-point jump achieved entirely without human labels.
As high-quality human data becomes increasingly scarce and expensive, this “self-evolving” approach offers a scalable path forward. It suggests that the next generation of AI might not need a human teacher to show them the way; they might just need enough “quiet time” to think through their own mistakes.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.