Meta Chain-of-Thought: Teaching LLMs to Think Like Humans
Large language models (LLMs) have made impressive strides in various tasks, but their reasoning capabilities often fall short when tackling complex problems. A new paper, “Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought,” proposes a novel framework called Meta Chain-of-Thought (Meta-CoT) to address this limitation. This framework aims to mimic the human thought process, going beyond simple chain-of-thought (CoT) prompting by explicitly modeling the underlying reasoning steps.
The core idea behind Meta-CoT is that complex reasoning isn’t a simple linear process. Humans often engage in an iterative process of exploration and verification, formulating hypotheses, testing them, and adapting their strategies as needed. This involves a “latent thinking” process not fully captured by traditional CoT, which simply guides the LLM to provide step-by-step reasoning towards a solution. Meta-CoT aims to explicitly model this latent process.
The paper supports this idea with empirical evidence from state-of-the-art LLMs like OpenAI’s o1 series, showing behaviors consistent with in-context search. For example, when faced with complex mathematical problems, the o1 models generate significantly more tokens than other models, suggesting that they are actively exploring and evaluating different solution paths rather than simply applying a pre-learned algorithm. Figure 1 in the paper visually demonstrates this, showing that OpenAI’s o1 models significantly outperform other models on the HARP mathematics benchmark, particularly in higher difficulty levels, and generate more tokens as the problem complexity increases.
The authors propose several ways to train models to generate Meta-CoTs, including:
-
Process Supervision: This involves training a separate model (a Process Reward Model or PRM) to evaluate the quality of intermediate reasoning steps. This allows the main LLM to backtrack from less promising paths and explore more promising ones. Figure 2 illustrates the effect of increasing the training data on Llama3.1 8B when evaluated using this method. The performance improvement is substantial, and the improvement does not plateau, suggesting the potential for even better performance with more data.
-
Meta Reinforcement Learning: This approach treats the reasoning process as a Markov Decision Process (MDP) and trains the LLM using reinforcement learning to find optimal reasoning strategies. The objective function encourages exploring different paths efficiently, focusing on finding the correct answer, as seen in Figure 18.
The paper further discusses the critical role of search algorithms in Meta-CoT. The authors investigate algorithms such as Monte Carlo Tree Search (MCTS) and A*, demonstrating how they can generate synthetic training data to teach LLMs this more complex type of reasoning. Figure 5 showcases the efficiency gains of tree search methods over other sampling methods on the game of 24.
However, the paper also highlights open research questions:
-
Scaling laws: How do the performance of LLMs scale with model size, training data, inference time compute, and the complexity of the reasoning task? Figure 6 shows the significant performance gain that is achieved with more compute and larger models in board games.
-
Open-ended verification: How can we reliably evaluate the quality of the reasoning process for problems that don’t have easily verifiable answers?
-
Emergence of super-intelligence: Can Meta-CoT lead to the emergence of truly novel reasoning capabilities beyond those observed in current LLMs?
In conclusion, the paper presents a compelling argument for Meta-CoT as a framework for enabling more human-like reasoning in LLMs. The proposed training pipeline offers a promising roadmap for future research, but significant challenges remain in scaling these algorithms, obtaining sufficient training data, and evaluating progress on problems without readily available correct solutions. The research emphasizes the need for large scale data and infrastructure to move beyond the limited scale of current experiments. The “Big MATH” project, which aims to compile a large, high quality dataset of mathematical problems, is an important step toward realizing this goal.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.