AI Papers Reader

Personalized digests of latest AI research

View on GitHub

ThinkPO: A Simple Post-SFT Method to Continuously Boost Large Language Model Reasoning Abilities

Large language models (LLMs) are increasingly used for complex problem-solving, but their reasoning capabilities remain a significant hurdle. Supervised fine-tuning (SFT), where models are trained on high-quality reasoning examples from larger LLMs, has emerged as an effective method to improve their reasoning performance. However, continually improving these SFT-ed models faces challenges: acquiring new high-quality data is costly and time-consuming, while repeated training often leads to performance plateaus or even declines.

A new paper introduces a simple yet effective post-SFT method called Thinking Preference Optimization (ThinkPO) to address these limitations. Instead of relying on costly new data, ThinkPO leverages readily available short chain-of-thought (CoT) reasoning responses as “rejected” answers and existing long CoT responses as “chosen” answers for the same questions. It then applies direct preference optimization to encourage the model to favor longer reasoning outputs.

ThinkPO is illustrated using a typical problem-solving scenario: a mathematical question requiring detailed reasoning. Consider a base LLM, which has a basic understanding of math but struggles with complex problems. SFT significantly improves this model’s accuracy on the MATH500 dataset (a standard benchmark) by providing the LLM with many high-quality examples of detailed mathematical reasoning. However, even after SFT, the model tends to provide short, less-reasoned answers, lacking the step-by-step explanations that often lead to more accurate solutions.

This is where ThinkPO comes in. ThinkPO is used to post-process the SFT-ed model. It uses readily available examples of both short, inaccurate answers (rejected) and the longer, accurate, step-by-step answers from the SFT data (chosen). By showing the model these paired examples and optimizing the model to prefer the longer, more detailed reasoning, ThinkPO encourages the LLM to produce longer responses. The results show that this simple technique further improves the model’s reasoning accuracy. For instance, ThinkPO increased the MATH500 accuracy of an SFT-ed model by an additional 8.6%, from 82.8% to 83.4%. Furthermore, the length of the model’s responses also increased significantly, with an average increase of 25.9%.

The effectiveness of ThinkPO is further validated across various mathematical reasoning benchmarks (e.g., AIME, GPQA, Olympiad) and with models of different sizes (3B, 7B, and 14B parameters). In all cases, ThinkPO consistently demonstrates an ability to further improve upon SFT’s performance. Importantly, it also shows that ThinkPO can continually boost the performance of publicly available, distilled SFT models like DeepSeek-R1-Distill-Qwen-7B, enhancing its MATH500 performance from 87.4% to 91.2% without additional data.

ThinkPO’s simplicity and effectiveness lie in its ability to repurpose existing data, overcoming the limitations of SFT and reducing the need for extensive new data collection. This significantly reduces the time and resources required to improve LLMs’ reasoning abilities. The researchers have made their code and dataset publicly available, making this technique easily reproducible and adaptable for further research and development in the field of LLM reasoning. While ThinkPO shows remarkable improvement in reasoning, it is important to note that hyperparameter tuning is crucial for optimal results.