AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Fine-Grained Reinforcement Learning Boosts Reasoning in AI Agents

Large Language Model (LLM) agents are showing great promise in tackling complex tasks, but training them to reason effectively over multiple steps remains a significant challenge. A new paper from researchers at the University of Minnesota, Prime Intellect, and Morgan Stanley, proposes a method to improve multi-turn reasoning in LLM agents by focusing on precise credit assignment at each step of the reasoning process.

The core idea is to treat the agent’s interaction with its environment as a Markov Decision Process (MDP). This means breaking down a complex task into a sequence of smaller decisions or “turns.” The challenge is figuring out which actions in each turn contribute to the final success or failure, so the agent can learn to make better choices.

The researchers found that existing Reinforcement Learning (RL) approaches often struggle with this problem. Many methods treat the entire sequence of actions as a single “bandit” problem, where the reward is only given at the end. This makes it difficult to determine which individual actions were helpful or harmful.

“Imagine you’re trying to train an agent to answer a question by searching Wikipedia,” explains Siliang Zeng, the lead author. “If the agent gets the final answer wrong, it’s hard to know if it’s because the initial search query was bad, or if the agent failed to summarize the information correctly.”

To address this, the researchers introduce a “turn-level advantage estimation” strategy. This means assigning credit (or blame) not just for the final outcome, but also for the intermediate steps. For example, was the search query well-formed? Did the search results contain the correct answer?

They integrate their strategy with Group Relative Preference Optimization (GRPO), a popular RL algorithm, creating a new method they call Multi-Turn GRPO (MT-GRPO). The core idea of GRPO is to compare the performance of different generated responses to figure out which one is better. MT-GRPO extends this to consider the individual turns within each response.

The results are promising. In experiments where an LLM agent uses Wikipedia to answer questions from the TriviaQA dataset, MT-GRPO achieves 100% success in executing tools correctly and 50% accuracy in exact answer matching. By contrast, baseline methods that don’t consider turn-level credit assignment struggle to use tools at all and only achieve 20-30% answer accuracy.

The researchers also found that MT-GRPO promotes more stable training. Agents trained with the baseline methods sometimes “forget” to use tools during training, but MT-GRPO leads to more consistent tool usage.

While the work focuses on a specific tool-use scenario, the authors emphasize that the turn-level advantage estimation strategy is general and can be applied to other RL algorithms and tasks. This could lead to significant improvements in the reasoning capabilities of AI agents in a wide range of applications, from customer service chatbots to scientific discovery assistants.