New Method Boosts Reasoning in Small AI Models with Smart Data Generation
Researchers at Zhongxing Telecom Equipment (ZTE) have developed a novel method for improving the reasoning abilities of smaller, more efficient AI models, known as Large Language Models (LLMs). The core innovation lies in how training data is created, specifically Chain-of-Thought (CoT) data. The research, detailed in a pre-print paper titled “Rethinking the Generation of High-Quality CoT Data from the Perspective of LLM-Adaptive Question Difficulty Grading,” focuses on generating training examples tailored to the specific strengths and weaknesses of these smaller models.
LLMs, especially the massive ones like DeepSeek-R1 with its 671 billion parameters, have shown impressive reasoning capabilities. However, these giants are computationally expensive, making them impractical for many real-world applications, especially on edge devices. The goal is to distill their knowledge into smaller models (with fewer than 70 billion parameters) that can perform complex reasoning tasks, such as solving math problems or generating code, with acceptable accuracy.
The key idea of the ZTE researchers’ method is to grade the difficulty of questions adaptively based on the smaller LLM’s own reasoning capabilities. Instead of relying on static difficulty labels or human intuition, they use the LLM itself to assess the questions. This adaptive approach is crucial because what might be challenging for one LLM could be trivial for another.
Here’s how the process works:
-
Adaptive Difficulty Evaluation: The researchers present a base LLM with a diverse set of questions. Questions answered correctly are deemed “easy”. For questions answered incorrectly, a “PRM-Grader” (Process Reward Model) analyzes the LLM’s reasoning steps and assigns a difficulty level (from 1 to 5, with 5 being most difficult). This step creates an LLM-Adaptive question database.
- Example: Imagine an LLM is tasked with solving algebraic equations. If it consistently solves linear equations but struggles with quadratic equations, the former will be classified as “easy”, and the latter will be assigned a higher difficulty level.
-
Comprehensive Adaptive Problem Library: Based on these difficulty grades, they build a comprehensive library. Questions of varying difficulty are sampled based on a distribution reflecting curriculum learning principles, meaning that there is a higher proportion of easier questions than difficult ones. This is to mimic human learning where concepts are built upon a solid foundation.
-
High-Quality CoT Data Generation: The questions are fed to a very powerful LLM (DeepSeek-R1) to generate “Chain-of-Thought” reasoning examples – step-by-step solutions to the problems. These CoT examples act as “teachers,” guiding the smaller LLM during training. Only the questions whose CoT answers could be validated as correct are used for fine-tuning.
- Example: The DeepSeek-R1 would solve the quadratic equation and explain its steps. The steps can be explained as: “To solve the equation x^2+5x+6=0…”.
The researchers demonstrated the effectiveness of their method on mathematical problem-solving (using datasets like AIME24) and code generation tasks (using LiveCodeBench). Remarkably, a 32-billion parameter model (ZMath-32B) trained with only 2,000 high-quality CoT examples generated using this method surpassed the performance of DeepSeek-Distill-32B (a model distilled from DeepSeek) in mathematical reasoning. Similar results were seen in code generation with the ZCode-32B model.
The study’s main contributions are demonstrating a method to improve the reasoning skills of smaller LLMs while saving on training cost. Further, the results show how essential it is to tailor training to each LLM. The researchers are planning to integrate reinforcement learning, and continue the research into smaller LLMs.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.