Small Language Models Can Reason Too: Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

🔊

💬 Ask

Large language models (LLMs) like GPT-4 have shown impressive capabilities for complex reasoning tasks, including solving mathematical word problems. However, fine-tuning these models requires access to large, high-quality datasets, which can be expensive and difficult to acquire. Recent research has explored the use of knowledge distillation, where smaller models are trained to mimic the reasoning abilities of larger, more powerful LLMs. While this approach has yielded promising results, it can be computationally expensive and relies on access to proprietary models.

A new paper, Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning, proposes a more efficient and scalable approach to enhance the reasoning abilities of small-scale language models. Instead of relying on knowledge distillation from LLMs, this method leverages self-training and incorporates a technique called Direct Preference Optimization (DPO).

Self-training is a process where a language model learns from its own outputs. The model generates its own reasoning steps for a given problem, and then uses these steps as training data to improve its performance. DPO further augments this process by incorporating human preferences. The model is trained to generate reasoning steps that align with a human expert’s preferences, which can be easily obtained for mathematical reasoning tasks since the solutions are typically unambiguous.

The paper’s main contributions are:

A novel extension to the self-training framework: The paper integrates DPO into self-training, leveraging preference data to guide the model towards more accurate and diverse reasoning.
Improved reasoning abilities: Experiments show that this approach significantly enhances the reasoning capabilities of small-scale language models across various mathematical reasoning tasks.
Cost-effectiveness: This method offers a more cost-effective and scalable solution compared to relying on large, proprietary LLMs.

For example, the paper evaluates its approach on the GSM8K benchmark, a dataset of 8,000 challenging math word problems. They compare their method to other baselines, including supervised fine-tuning (SFT) and a baseline self-training approach. The results show that their proposed DPO-augmented self-training method consistently outperforms both baselines, achieving higher accuracy on both in-distribution and out-of-distribution tasks. Furthermore, the paper demonstrates that their method achieves similar performance with significantly fewer training samples than other approaches, highlighting its efficiency.

This paper presents a valuable contribution to the field of language model reasoning. It introduces a novel and efficient method for enhancing the reasoning abilities of small-scale models, offering a more scalable and cost-effective approach to training models for complex reasoning tasks. This work is particularly significant for applications where access to large, proprietary LLMs is limited. The authors suggest that future research could explore integrating knowledge distillation into their iterative training process, potentially leading to even better performance.

AI Papers Reader

Personalized digests of latest AI research

Small Language Models Can Reason Too: Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Chat about this paper