AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Large Language Models Are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

Large language models (LLMs) are impressive, but they struggle with complex reasoning tasks that require multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can help, optimizing reasoning during training is challenging. This paper introduces LaTent Reasoning Optimization (LaTRO), a new framework that optimizes reasoning by treating it as sampling from a latent distribution. LLMs learn to simultaneously improve their reasoning process and ability to evaluate reasoning quality.

Imagine an LLM trying to solve the following:

“A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts does it take?”

LaTRO helps the LLM by treating the reasoning steps as latent variables, and optimizing the process to maximize the likelihood of the correct answer. For example, LaTRO could help the LLM to consider these two reasoning paths:

LaTRO would then optimize the model to favor Reasoning 1, as it leads to the correct answer. Since LaTRO doesn’t require external feedback, it can learn from its own mistakes and improve its reasoning abilities over time.

The paper validates LaTRO with experiments on GSM8K and ARC-Challenge datasets. Across multiple architectures, LaTRO consistently improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning. This suggests that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced.

These findings have several implications:

The paper concludes that LaTRO represents a significant step towards creating more intelligent systems that can self-evolve their problem-solving capabilities. The work opens up new avenues for improving the reasoning abilities of LLMs, and its potential for self-improvement is a promising development in the field of artificial intelligence.