Supervised Training on Proof Failures Transforms AI Theorem Provers
In a significant advance for automated mathematical reasoning, researchers have demonstrated that training large language models (LLMs) specifically on failed proofs and compiler diagnostics dramatically improves their ability to fix errors, positioning smaller models to outperform much larger state-of-the-art systems.
The new work addresses a critical bottleneck in neural theorem proving: while AI models have achieved near-Olympiad levels of performance in formal verification systems like Lean 4, existing training datasets consist overwhelmingly of successful proofs. This leaves models ill-equipped to act like human proof engineers, who constantly interpret compilation failures and incrementally refine their code.
To solve this, a research team introduced APRIL (Automated Proof Repair in Lean), a massive dataset of 260,000 supervised training tuples. Each tuple pairs an erroneous Lean proof with the exact compiler diagnostics (error messages, local proof state), an aligned, corrected proof, and a natural-language explanation of the error and suggested fix.
Crucially, APRIL was built not by mining human errors, which are scarce, but by systematically introducing controlled mutations into verified proofs. This process ensures the generated failures are realistic and structurally aligned with the intended proof sketch.
The errors fall into four main types, designed to stress different aspects of semantic understanding:
- Theorem Mutation Errors: Replacing a correct theorem (e.g., a rule requiring a positive integer) with a semantically similar but type-incompatible theorem. This forces the model to reason precisely about hypotheses and type structure rather than just correcting syntax.
- Tactic Mutation Errors: Swapping an effective tactic like
nlinarith(a non-linear arithmetic solver) with a plausible but inadequate substitute, such asring. Fixing this requires the model to understand why the current goal state demands a more powerful technique. - Line/Multi-line Mutation Errors: Redacting a line or block of code and prompting an LLM to hallucinate a plausible but incorrect completion, simulating the “near-miss” failures common in real development.
When testing proof repair accuracy, the results were striking. A 4-billion-parameter LLM (Qwen3-4B-Instruct) finetuned on APRIL saw its repair accuracy jump from a baseline of 1.1% to 27.4% in a single-shot repair evaluation (meaning no iterative search or retries were allowed).
This finetuned 4B model narrowly outperformed the most powerful open-source prover baseline, Goedel-Prover-V2-32B, which achieved 26.8% under the same zero-search protocol.
This suggests that targeted, error-centric supervised training is a strong, yet underutilized, signal that can partially offset the advantages of model scale. While end-to-end theorem proving typically relies on vast scale and extensive search, the APRIL dataset proves that training models to interpret and act on compiler feedback effectively builds “agentic” capabilities—the ability to diagnose failures and execute targeted repairs—making iterative proof refinement substantially more efficient. The researchers argue that this approach is vital for building future AI provers that can seamlessly integrate into human-like interactive development environments.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.