AI Papers Reader

Personalized digests of latest AI research

View on GitHub

SPC: AI Model Evolves its Reasoning Skills Through Adversarial Games

A new study introduces a Self-Play Critic (SPC), a novel approach designed to enhance the reasoning capabilities of large language models (LLMs) such as those used in advanced AI assistants. SPC employs a unique adversarial game where two copies of an LLM are fine-tuned to play opposing roles: a “sneaky generator” that introduces subtle errors, and a “critic” that identifies these errors.

The core idea is to have these two models compete in a game-like scenario. The sneaky generator attempts to inject incorrect reasoning steps into a problem-solving process that are difficult for the critic to detect. For example, in a math problem involving base conversions, the generator might introduce a subtle calculation error while converting a binary number to base 4, changing “1121_4” to “11121_4”. The critic, on the other hand, analyzes each step to determine its correctness.

Through reinforcement learning, the models iteratively improve. If the critic successfully identifies the error, it receives a positive reward, and the sneaky generator gets a negative one. Conversely, if the generator fools the critic, the rewards are reversed. This feedback loop drives both models to evolve their abilities—the generator becoming better at creating deceptive errors, and the critic becoming more adept at detecting them.

Researchers tested SPC on several reasoning benchmarks, including ProcessBench, PRM800K, and DeltaBench. Results showed that SPC significantly enhances error detection. For example, on ProcessBench, accuracy in identifying incorrect reasoning steps increased from 70.8% to 77.7%. Furthermore, SPC outperformed existing methods, including a “distilled R1 model,” a strong baseline for LLM reasoning.

The utility of SPC extends beyond just identifying errors. The paper introduces a method to use the trained critic to guide test-time search of LLMs. During problem-solving, the critic predicts the correctness of each step. If a step is deemed incorrect, the LLM promptly abandons it and regenerates a new one. Experiments on the MATH500 and AIME2024 datasets showed that this approach improves the mathematical reasoning performance of various LLMs, including Llama, Qwen, and distilled R1. For instance, integrating SPC with a Qwen solver improved problem-solving accuracy on AIME2024 to 23.3%, outperforming a self-consistency approach.

This research highlights the potential of adversarial self-play as a training method for improving the reliability and accuracy of LLM reasoning, with significant implications for complex problem-solving tasks. The system provides a way to continuously generate training data without requiring extensive human annotations.