AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Teaching AI to Use Tools Without the "Training Wheels" of Costly Labeled Data

Large language models (LLMs) are often compared to brilliant scholars locked in a room with a fixed set of books. They are incredibly knowledgeable about everything written up to their last day of training, but they are “frozen” in time. If you ask a standard model about a news event from last week or a complex math problem requiring a Python script, it often falters.

To break out of this cage, AI researchers have been trying to teach models to use external tools like search engines and calculators. However, the traditional way of teaching these skills—a process called Supervised Fine-Tuning (SFT)—is notoriously slow and expensive. It requires humans to manually write out thousands of “gold standard” examples showing exactly how a model should search, think, and respond.

A team of researchers from the National University of Singapore, Salesforce AI Research, and UC Berkeley may have found a better way. In a new paper, they introduce In-Context Reinforcement Learning (ICRL), a method that allows AI to teach itself how to use tools through trial and error, effectively skipping the expensive human-labeling phase.

The Problem with “Cold Starts”

Teaching an AI to use a tool is a “cold-start” problem. If you just give a raw model access to a search engine and tell it to solve a problem, it doesn’t know the “syntax”—it might not know it needs to wrap its query in specific tags like <search>. Because it fails immediately, it never receives a “reward” for getting the answer right, and therefore never learns.

Standard practice solves this by using SFT to give the model a push. But ICRL takes a different approach: it uses “few-shot” prompting as a temporary scaffold.

How ICRL Works: The Vanishing Scaffold

Imagine teaching a child to solve a puzzle. Instead of moving their hands for them (SFT), you show them three completed puzzles first (few-shot prompting). Then, you give them a new puzzle and let them try it. As they get better, you show them only two examples, then one, and finally, you let them work entirely on their own.

ICRL follows this exact curriculum:

  1. Imitation: During the initial phase of reinforcement learning, the model is shown a few examples of “tool-augmented” reasoning within its prompt.
  2. Exploration: Guided by these examples, the model begins to explore how to call tools. It is rewarded when it reaches the correct answer and follows the right format.
  3. Independence: As training progresses, the researchers gradually reduce the number of examples until the model is working in a “zero-shot” setting—using tools autonomously without any hints.

To ensure the model learns correctly, the researchers used a technique called “loss masking.” This ensures the model only learns from its own reasoning and tool-calling actions, rather than accidentally trying to memorize the information it retrieved from the internet.

Concrete Results: The Washington Test

To build an intuition for how this helps, consider a “multi-hop” question: “When did the president who set the two-term limit enter office?”

A model without tool access might guess or hallucinate. A model trained via ICRL, however, learns a sequence:

  • Step 1 (Think): It realizes it needs to identify the president first.
  • Step 2 (Search): It invokes <search> president two term limit precedent </search>.
  • Step 3 (Process): It receives information about George Washington.
  • Step 4 (Search): It issues a second search: <search> George Washington inauguration date </search>.
  • Step 5 (Answer): It concludes with the correct date: April 30, 1789.

The researchers tested ICRL on the Qwen2.5 model family across several difficult benchmarks. The results were striking: ICRL outperformed traditional methods by as much as 8.9% on general knowledge tasks and even matched or exceeded the performance of models that had the benefit of thousands of human-labeled examples.

By eliminating the need for expensive manual data, ICRL provides a scalable, more efficient path toward AI agents that can navigate the real world, search the live web, and solve complex problems as easily as they chat.