AI Agents Suffer from a “Collaboration Gap,” Tanking Performance in Simple Teamwork
New research reveals that even high-performing AI models struggle dramatically when required to team up, signaling a critical roadblock for the future of multi-agent systems.
[4 November 2025] The vision of AI agents working together to solve complex tasks is driving massive investment across the technology sector. However, a new study using a specialized maze-solving benchmark has uncovered a fundamental weakness in current large language models (LLMs): a massive “collaboration gap” that causes high-performing individual agents to collapse into ineffectiveness when forced to work as a team.
Researchers at EPFL and Microsoft Research introduced a novel, scalable collaborative maze-solving task designed to isolate and measure an agent’s ability to communicate and coordinate. Unlike standard solo benchmarks, agents were presented with distributed, partially obscured maps—each agent only sees half the maze—forcing them to engage in dialogue to “ground” a shared understanding of the environment and agree on every move.
The Paradox of Solo Success
The results, drawn from evaluating 32 leading open- and closed-source models in various pairings, revealed a paradoxical drop in capability. Models that could solve the full maze nearly perfectly alone often saw their performance degrade substantially in collaborative mode.
“Collaboration broke down dramatically,” the authors note. For instance, smaller models specifically engineered for efficiency (distilled models like GPT-5-nano) that performed moderately well solo often failed almost completely in team settings.
This suggests that current AI training paradigms prioritize individual reasoning and output quality, failing to instill the crucial capabilities required for dynamic, on-the-fly collaboration.
A key failure mode observed was a lack of “grounding”—the process by which participants establish mutual understanding. In one instance, two agents (Grok-3) struggled because one agent communicated coordinates as (row, column), while the other misinterpreted them as (column, row). This seemingly minor misalignment led to a conflict over whether a specific path was open or a wall, halting progress entirely.
Ordering Effects and Relay Inference
The study also investigated collaboration between agents of different strengths (heterogeneous pairings). A significant “ordering effect” emerged: performance was heavily influenced by which agent initiated the dialogue.
For example, when the powerful O3 model was paired with the weaker GPT-4.1-mini, the pair performed significantly better if O3 spoke first. When the weaker agent led, the results lagged.
Based on this insight, the researchers propose a new strategy called “Relay Inference,” designed to mitigate the collaboration gap. This approach involves leveraging a stronger agent to “seed” or “prime” the initial steps of the dialogue before handing the task off to a cheaper, weaker model.
The results for Relay Inference were compelling: even just one initial message from a strong model was enough to significantly boost the performance of the weaker partner, closing much of the observed gap. This suggests that the strong model quickly establishes a reliable communication protocol and shared mental model, which the weaker agent can then successfully inherit.
The authors argue that collaboration represents a distinct capability axis that must be explicitly designed for, rather than being hoped for as an emergent property of existing training. The existence of the collaboration gap in a simple, stylized task like maze-solving suggests the difficulty will be even greater in real-world, long-horizon applications.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.