AI Papers Reader

Personalized digests of latest AI research

View on GitHub

New Framework Trains LLM Agents to Tackle Complex, Long-Horizon Tasks

Shanghai, China - Researchers have developed a novel framework, AgentGym-RL, and a training methodology, ScalingInter-RL, aimed at empowering Large Language Models (LLMs) to tackle complex, real-world tasks that require sequential decision-making over extended periods. The breakthrough, detailed in a recent paper, addresses a key limitation of current LLM agents: their difficulty in performing tasks that involve multiple steps and sustained interaction with an environment.

Unlike previous approaches that often rely on extensive pre-training or supervised fine-tuning, AgentGym-RL trains LLM agents from scratch using reinforcement learning (RL). This allows the agents to learn and adapt through direct interaction, mirroring human cognitive development. The framework is designed to be modular and extensible, supporting a wide array of real-world scenarios and mainstream RL algorithms.

A significant innovation introduced by the research is ScalingInter-RL. This training approach tackles the “exploration-exploitation dilemma” in RL. Initially, agents are trained with limited interaction turns to focus on exploiting learned skills and mastering basic tasks. As training progresses, the interaction horizon gradually expands, encouraging deeper exploration, refinement of strategies, and the ability to handle more complex challenges. This staged approach is designed to prevent the agent from collapsing under long-horizon tasks and to encourage the development of more diverse problem-solving behaviors.

The effectiveness of AgentGym-RL and ScalingInter-RL has been demonstrated through extensive experiments. The researchers report that their trained agents, even those based on smaller, open-source LLMs (like a 7B parameter model), can match or even surpass the performance of much larger, proprietary models on 27 diverse tasks. For instance, in the challenging WebArena benchmark, which simulates web navigation, their agent achieved a 26% overall accuracy, outperforming models like GPT-40. In the deep search benchmark, their agent achieved a 38.25% overall score, rivaling top proprietary models.

The framework supports a variety of environments, including web navigation (WebArena), deep search, digital games (TextCraft), embodied tasks (BabyAI), and scientific reasoning (SciWorld). The paper highlights how AgentGym-RL significantly boosts the performance of LLMs in these domains, enabling them to overcome unproductive loops and exhibit more systematic task execution.

The research also sheds light on the potential of post-training and test-time computation over simply increasing model size. Their 7B parameter agent, trained with AgentGym-RL, achieved an impressive success rate of approximately 58.6%, surpassing larger models with significantly more parameters.

The authors emphasize that AgentGym-RL is released as an open-source framework to foster collaborative research. They believe this work represents a significant step towards developing more capable and adaptable LLM agents that can autonomously navigate and solve complex real-world problems. Future research directions include enhancing agents’ generalization capabilities to novel environments and exploring multi-agent reinforcement learning.