Large Language Models Are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding
💬 Ask
Large language models (LLMs) are impressive, but they struggle with complex reasoning tasks that require multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can help, optimizing reasoning during training is challenging. This paper introduces LaTent Reasoning Optimization (LaTRO), a new framework that optimizes reasoning by treating it as sampling from a latent distribution. LLMs learn to simultaneously improve their reasoning process and ability to evaluate reasoning quality.
Imagine an LLM trying to solve the following:
“A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts does it take?”
LaTRO helps the LLM by treating the reasoning steps as latent variables, and optimizing the process to maximize the likelihood of the correct answer. For example, LaTRO could help the LLM to consider these two reasoning paths:
- Reasoning 1 (correct): “It takes 2/2=1 bolt of white fiber. 2+1=3. So, it takes a total of 3 bolts of fiber.”
- Reasoning 2 (incorrect): “We need 2 bolts of blue and 2 bolts of white fiber. In total, it is 2+2=4.”
LaTRO would then optimize the model to favor Reasoning 1, as it leads to the correct answer. Since LaTRO doesn’t require external feedback, it can learn from its own mistakes and improve its reasoning abilities over time.
The paper validates LaTRO with experiments on GSM8K and ARC-Challenge datasets. Across multiple architectures, LaTRO consistently improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning. This suggests that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced.
These findings have several implications:
- Pre-trained LLMs are capable reasoners: LaTRO proves that LLMs possess the potential to act as reward models for evaluating reasoning paths. This means that we don’t need to rely solely on external feedback to improve their reasoning abilities.
- Improving reasoning without external feedback: LaTRO enables LLMs to self-improve their reasoning capabilities without relying on task-specific few-shot examples or external reward models, which is a significant advantage.
- Shifting computation from inference to training: LaTRO can compress reasoning processes and shift computational burdens from inference time to training time, leading to faster and more efficient reasoning at inference.
The paper concludes that LaTRO represents a significant step towards creating more intelligent systems that can self-evolve their problem-solving capabilities. The work opens up new avenues for improving the reasoning abilities of LLMs, and its potential for self-improvement is a promising development in the field of artificial intelligence.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.