AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Self-Playing AI Learns Advanced Math by Co-Evolving as Teacher and Student

EDINBURGH, UK – Researchers have unveiled a new method called Open-Ended Self-Improving Reasoner (OpenSIR) that allows large language models (LLMs) to autonomously generate and master increasingly complex mathematical problems without relying on human-annotated training data. The self-play framework marks a significant step toward achieving truly open-ended artificial intelligence capable of continual discovery.

Traditionally, improving LLM reasoning via reinforcement learning requires massive, human-labeled datasets to provide verifiable reward signals. This reliance limits scalability and confines AI performance to human-level benchmarks. OpenSIR bypasses this bottleneck by assigning a single LLM policy two alternating roles: a Teacher that generates novel problems, and a Student that attempts to solve them.

The core of OpenSIR’s innovation lies in how the Teacher optimizes for problem novelty, which is defined along two critical dimensions: difficulty and diversity.

First, to maintain optimal challenge, the Teacher generates problems that are difficult but still solvable. Solvability is measured by the student’s solution consistency across multiple attempts. If the problem is too easy (high solve rate) or too hard/malformed (low solve rate), the reward decreases. Second, diversity rewards problems that explore new mathematical concepts, ensuring the model continuously broadens its skill set rather than repeating familiar ideas.

Starting from a single trivial prompt—”What is 1+1?”—OpenSIR successfully bootstraps its own curriculum.

The empirical results show dramatic self-improvement across various models. For instance, the Llama-3.2-3B-Instruct model’s average accuracy across five math benchmarks (including GSM8K and College Math) improved by 3.6 points. The Gemma-2-2B-Instruct saw an even more substantial leap of 5.9 points. Remarkably, OpenSIR outperformed traditional reinforcement learning baselines (GRPO) which relied on thousands of human-annotated examples, demonstrating that zero-shot self-play can be more effective than training on curated data.

Qualitative analysis confirms OpenSIR’s ability to create an adaptive curriculum. The model starts with basic arithmetic but quickly progresses to generating challenging questions in domains like calculus, optimization, and trigonometry-based physics. For example, the Teacher autonomously moved from simple sums to complex optimization problems, such as calculating the maximum number of different sized containers that fit in a warehouse under specific logistical constraints.

Crucially, the study found that the co-evolution of the Teacher and Student roles is essential. If the Teacher’s abilities are fixed, it cannot adapt the difficulty of its generated problems to the Student’s improving skills, leading to inconsistent challenges and significantly degraded learning performance.

By dynamically calibrating difficulty and actively promoting the exploration of varied concepts, OpenSIR provides a new paradigm for mathematical reasoning development, enabling LLMs to expand their capabilities autonomously beyond the boundaries set by human data.