New Benchmark AGENTSNET Tests Large Language Model Coordination and Collaboration

🔊

💬 Ask

A new benchmark called AGENTSNET has been introduced to evaluate how well large language models (LLMs) can coordinate and collaborate in multi-agent systems. Developed by researchers from ETH Zurich and RWTH Aachen University, AGENTSNET draws inspiration from fundamental problems in distributed computing and graph theory to assess LLMs’ abilities in self-organization, communication, and collaborative problem-solving.

Unlike existing benchmarks that typically involve a small number of agents (2-5), AGENTSNET is designed to be scalable, supporting networks of up to 100 agents. This allows for more realistic evaluations of how LLMs perform in complex, distributed environments.

The benchmark comprises five core tasks derived from distributed computing principles:

Coloring: Agents must assign themselves to groups (colors) such that no two connected agents share the same group. This task tests the ability to distribute responsibilities without overlap. For example, agents might be assigned different roles like “web search” or “coding” with the constraint that connected agents cannot have the same role.
Vertex Cover: Agents need to select a minimal set of “coordinator” agents such that every connection (edge) in the network has at least one coordinator. This evaluates the ability to identify influential nodes for system monitoring or bridging communication gaps.
Matching: Agents must form pairs with their neighbors without conflicts. This task assesses their capability to negotiate agreements for tasks or resource allocation. For instance, agents might need to pair up to complete specific sub-tasks.
Leader Election: The system must collaboratively select a single agent as a leader, with all other agents acknowledging this choice. This is crucial for establishing hierarchy and centralized decision-making within a multi-agent system.
Consensus: All agents must agree on a single value (e.g., 0 or 1) through communication. This tests the system’s ability to converge to a global agreement.

The researchers evaluated several frontier LLMs, including models from OpenAI (GPT-4.1 mini), Google (Gemini series), and Anthropic (Claude series), as well as open-source models like Llama 4. The results indicate that while some advanced LLMs show strong performance on smaller networks, their effectiveness often diminishes as the network size increases. For example, Gemini 2.5 Pro and Claude 3.7 Sonnet demonstrated high scores across various tasks, but even these models struggled with larger networks, with performance dropping significantly for 100-agent configurations.

Qualitative analysis of the LLMs’ communication revealed key challenges, including difficulties in strategy coordination, agents accepting erroneous information from neighbors, and the complexity of synchronous message passing. Despite these challenges, some agents demonstrated proactive behavior, like one agent noticing and helping to resolve conflicts in group assignments for the Coloring task.

AGENTSNET’s flexible design, based on generating problems from various graph topologies (small-world, scale-free, and Delaunay), allows it to adapt and scale with future advancements in LLM capabilities, making it a valuable tool for pushing the boundaries of collaborative AI.

AI Papers Reader

Personalized digests of latest AI research

New Benchmark AGENTSNET Tests Large Language Model Coordination and Collaboration

Chat about this paper