New LLM Reasoning Framework Reduces Costs Without Sacrificing Accuracy
Large language models (LLMs) are transforming numerous fields, but their computational costs remain a significant hurdle. A new paper, “Token-Budget-Aware LLM Reasoning,” introduces a novel framework, TALE, that significantly reduces the token usage of LLMs during reasoning tasks, while maintaining high accuracy. The key innovation lies in dynamically estimating and enforcing a token budget within the prompt, effectively compressing the reasoning process.
Chain-of-thought (CoT) prompting has substantially improved LLM reasoning abilities. CoT guides the LLM to break down complex problems into smaller, manageable steps. However, this detailed step-by-step reasoning drastically increases the number of tokens generated, leading to higher computational costs and energy consumption. For example, answering a simple question about scheduling after-work activities, without CoT might take 15 tokens, while using CoT can balloon the token count to 258. This disparity highlights the issue addressed by the TALE framework.
The core idea behind TALE is to recognize and leverage the inherent ability of LLMs to constrain their responses based on explicit instructions. The researchers observed that including a token budget limit in the prompt (e.g., “Let’s think step-by-step and use less than 50 tokens:”) significantly reduces the output token count. However, setting the budget too low can lead to a “token elasticity” phenomenon, where the LLM fails to adhere to the constraint and generates far more tokens than expected.
To address this challenge, TALE incorporates a two-stage process. First, it estimates an appropriate token budget for a given question using a budget estimator. The researchers explore several approaches for budget estimation, including a zero-shot method that prompts the LLM itself to estimate the required token count and a regression-based method trained on data pairs of questions and their optimal token budgets (found through a binary search). Second, TALE incorporates this estimated budget into the CoT prompt, guiding the LLM to produce a concise and accurate response within the allocated token limit.
The authors rigorously evaluated TALE using three state-of-the-art LLMs—GPT-40, GPT-40-mini, and Yi-lightning—across several challenging mathematical reasoning datasets. The results demonstrate that TALE achieves, on average, a remarkable 68.64% reduction in token usage, while maintaining accuracy levels within 5% of the standard CoT approach. For example, on the GSM8K-Zero dataset, TALE reduced token costs from 252.96 to 22.67 tokens, resulting in a 91% reduction while keeping the accuracy extremely high (98.72%).
This research has significant implications for making LLMs more cost-effective and accessible. The findings suggest that the reasoning process of current LLMs is often unnecessarily verbose. TALE provides a practical solution for balancing efficiency and accuracy, paving the way for wider adoption of LLMs in resource-constrained environments. The code for TALE is publicly available, enabling others to build upon and extend this work. Future research may focus on improving the accuracy of the budget estimation techniques and exploring its applicability to a broader range of tasks and LLMs.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.