AI Models Learn to Stop "Thinking Lazy" to Master Complex Tasks

🔊

💬 Ask

New Training Framework Enables Smaller Language Models to Outperform Larger Competition in Tool Use

A team of researchers has unveiled D-CORE (Decomposing tasks and Composing Reasoning processes), a new training framework designed to combat a critical flaw in modern Large Reasoning Models (LRMs): a tendency toward inefficient, non-structural thinking dubbed “Lazy Reasoning” when tackling complex, multi-step problems involving external tools.

The breakthrough allows smaller models to achieve performance previously restricted to much larger systems, demonstrating superior tool proficiency across diverse benchmarks. Notably, a D-CORE-trained 14-billion parameter model achieved 79.3% accuracy on standard tests, outperforming state-of-the-art 70-billion parameter models while being five times smaller.

The Problem of Lazy Reasoning

LRMs—the engines behind autonomous agents—rely heavily on reasoning capabilities (often expressed via internal <think> blocks) to decide which tools to use and when. Through empirical analysis, the researchers found that in complex scenarios involving multiple tool calls or conversational turns, baseline models often fail to structurally decompose the task.

Instead of clear planning, they resort to a lengthy process of trial-and-error, generating excessive but ultimately meaningless internal dialogue.

“Lazy Reasoning is a compensatory mechanism,” the authors write, suggesting models default to unproductive verbal reflection when they lack the core capacity for structural decomposition.

For instance, when asked to set a 50,000 RMB budget limit in USD using a tool, a baseline LRM might generate over 1,600 tokens of internal thought, repeatedly cycling through possible parameters, confusing tool outputs, and questioning the original request before producing a convoluted answer.

D-CORE: Structure and Diversity

The D-CORE framework addresses this two-pronged challenge using a two-stage process:

Incentivizing Task Decomposition (Self-Distillation): The model is first trained via self-distillation to generate its own structured subtask list before execution. For the RMB-to-USD conversion, D-CORE forces the model to break it down clearly: 1) Convert RMB to USD using the exchange rate tool. 2) Set the budget limit using the converted USD amount. This disciplined approach drastically reduces the token count and increases efficiency.
Diversity-Aware Reinforcement Learning (DA-GRPO): The initial structural training tends to homogenize the model’s thinking. To restore effective reflection and exploration, D-CORE introduces DA-GRPO, a novel reinforcement learning algorithm. By incorporating an entropy-based advantage function, DA-GRPO rewards the generation of diverse, high-entropy tokens, ensuring the model maintains its reflective capabilities without sacrificing the new structural decomposition skills.

A Leap in Tool-Use Proficiency

The results show a clear shift from inefficient reasoning to efficient, executable processes. On the BFCLv3 benchmark, D-CORE models achieved substantial accuracy gains, particularly on challenging multi-turn tasks where performance improved by 30.8%.

This enhanced capability translates directly to real-world agent performance. On the T-Bench Airline task—which requires handling complex multi-step processes like assessing refunds, checking flight details, and compensating users—the D-CORE 14B model established a new state-of-the-art, validating that robust task decomposition is the key to unlocking true proficiency in complex tool use.

AI Papers Reader

Personalized digests of latest AI research

AI Models Learn to Stop "Thinking Lazy" to Master Complex Tasks

The Problem of Lazy Reasoning

D-CORE: Structure and Diversity

A Leap in Tool-Use Proficiency

Chat about this paper