AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Satori: A Self-Improving Language Model That Masters Reasoning Through Autoregressive Search

A team of researchers from MIT, Singapore University of Technology and Design, Harvard, MIT-IBM Watson AI Lab, IBM Research, and UMass Amherst have developed a new 7-billion parameter large language model (LLM) called Satori. This model significantly advances the state-of-the-art in LLM reasoning capabilities, particularly in mathematical problem-solving, and demonstrates strong generalization to out-of-domain tasks. Their findings, published in a preprint on arXiv, detail a novel two-stage training paradigm that leverages reinforcement learning to enable Satori to perform autoregressive search, effectively internalizing the search process within a single LLM.

Traditional approaches to enhancing LLM reasoning often involve either extensive human annotation of training data or employing a two-player system where one LLM generates solutions and another verifies them. Both methods are costly and inefficient. Satori’s innovation lies in its ability to internally refine its reasoning process through a combination of format tuning and reinforcement learning.

The first stage of Satori’s training involves format tuning. The researchers fine-tuned a pre-trained LLM on a relatively small dataset (10,000 examples) of carefully constructed reasoning trajectories. These trajectories are generated by a multi-agent system, which includes a generator LLM, a critic LLM, and a reward model LLM. This initial stage familiarizes Satori with a new reasoning format called Chain-of-Action-Thought (COAT). COAT incorporates special meta-action tokens such as <|continue|>, <|reflect|>, and <|explore|>, prompting the model to continue reasoning, pause for self-reflection, or explore alternative solution strategies, respectively. For example, if Satori makes a mistake in a calculation, a <|reflect|> token would trigger a self-correction step.

The second stage uses reinforcement learning (RL) with a technique called “Restart and Explore (RAE)” to significantly improve Satori’s reasoning abilities. The RL training leverages the COAT format, using a reward model to guide the learning process. This reward system not only considers whether Satori’s final answer is correct but also rewards exploration and self-correction behaviors. RAE further enhances learning efficiency by allowing Satori to restart its reasoning from intermediate steps within a trajectory, enabling it to focus on correcting errors rather than starting anew. This is particularly important for tasks with sparse rewards, such as mathematical problem-solving, where the correct answer provides the only reward signal.

The results show that Satori achieves state-of-the-art performance on several mathematical reasoning benchmarks, including GSM8K, MATH500, and OlympiadBench, significantly outperforming models trained using traditional supervised fine-tuning methods. More importantly, Satori exhibits strong generalization abilities, excelling in out-of-domain tasks such as logical reasoning, code reasoning, and commonsense reasoning.

The researchers also conducted ablation studies, demonstrating the importance of both the COAT mechanism and the RL training with RAE. The use of COAT resulted in substantial performance improvements over traditional Chain-of-Thought reasoning. Similarly, RL training significantly enhanced Satori’s ability to self-correct errors and improve accuracy, showcasing the effectiveness of this self-improvement approach.

Satori represents a significant leap forward in LLM reasoning. By internalizing the autoregressive search process, it overcomes the limitations of previous methods, achieving both high performance and efficient use of resources. The open-sourcing of the code, data, and models associated with Satori should accelerate future research in this critical area of artificial intelligence.