New AI Model Scales Reasoning Ability, Cuts Thinking Tokens by 70%
Large language models (LLMs) are increasingly relied upon for complex problem-solving, but their extensive reasoning processes can be computationally expensive. Researchers at Tsinghua University and Yale University have introduced a new approach, dubbed “Z1,” that significantly reduces the computational cost associated with test-time scaling, a process where models adjust their reasoning based on problem complexity.
The key innovation lies in training LLMs on code-related reasoning trajectories while incorporating a “Shifted Thinking Window.” This window allows the model to adapt its reasoning level to the complexity of the problem, avoiding unnecessary “overthinking.”
“Essentially, we’re teaching the model to be more efficient with its thought process,” explains Zhaojian Yu, the lead author of the paper. “It’s like knowing when to grab a calculator for a complex equation versus solving a simple one in your head.”
The researchers created a new dataset, Z1-Code-Reasoning-107K, comprised of 107,000 code-related problems paired with both short and long solution trajectories. These trajectories represent the step-by-step reasoning processes required to solve the problems.
To combat overthinking, the researchers introduced the Shifted Thinking Window. This technique removes traditional delimiters (like <think>...</think>
) and dynamically caps reasoning tokens, allowing the model to directly output concise answers for simple problems. For more complex tasks, if the model exceeds a token threshold, a “hint phrase” encourages it to generate a final answer based on its existing thought process.
The resulting model, Z1-7B, demonstrates impressive test-time scaling. In one experiment, it matched the performance of R1-Distill-Qwen-7B, another powerful LLM, while using only 30% of its average thinking tokens. “This means Z1 can achieve the same level of accuracy with significantly less computational resources,” Yu emphasizes.
For example, imagine two models trying to solve the problem, “Write a Python script to calculate the number of letter ‘a’ and ‘r’ in a string”. One model might needlessly generate 1700 plus tokens of reasoning steps before producing the script, while the Z1 can proceed directly to generating the script and saving itself 1700 plus tokens.
Even more surprising, Z1-7B, fine-tuned solely on code trajectories, showed strong generalization capabilities on broader reasoning tasks, achieving 47.5% accuracy on GPQA Diamond, a notoriously difficult science question answering benchmark. This suggests that training on code-related reasoning can provide a foundation for more general reasoning skills.
Further analysis revealed that the length of the reasoning trajectories used during training plays a crucial role in test-time scaling efficiency. Models trained on datasets with longer average trajectories exhibited better performance.
The researchers are making their model, data, and code open-source, hoping to foster further research into efficient reasoning elicitation for LLMs. With increasing demands on AI compute, techniques like the Shifted Thinking Window offer a promising pathway toward more sustainable and accessible AI systems.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.