AI Agents Put to the Test: New Benchmark Reveals Strengths and Limitations of LLMs in Real-World Work
A team of researchers from Carnegie Mellon University and independent institutions has unveiled a groundbreaking new benchmark, TheAgentCompany, designed to rigorously evaluate the performance of large language model (LLM) agents in realistic professional settings. The study, published as a preprint on arXiv, reveals a nuanced picture of AI’s capabilities, showcasing both impressive progress and significant limitations in automating real-world tasks.
TheAgentCompany simulates a small software engineering company, complete with internal websites (mimicking GitLab, OwnCloud, Plane, and Rocket.Chat), simulated colleagues, and a diverse range of 175 tasks representative of a software developer’s workday. These tasks span various domains including software engineering, project management, financial analysis, and human resources. For example, tasks might involve analyzing spreadsheets, writing code, communicating with simulated colleagues through a chat interface, managing project sprints, or even filling out tax forms.
The researchers tested seven prominent LLMs, including both closed API-based models (like Anthropic’s Claude, OpenAI’s GPT-4, Google’s Gemini, and Amazon’s Nova) and open-weight models (like Meta’s Llama and Alibaba’s Qwen), using the OpenHands agent framework. The framework allows the agents to interact with the simulated environment via web browsing, code execution, and communication with simulated colleagues.
The results paint a complex picture. The best-performing agent, powered by Claude 3.5 Sonnet, autonomously completed 24% of the tasks. This means that the AI could successfully navigate the simulated environment, complete all the necessary steps, and achieve the desired outcome without human intervention for roughly one quarter of the assignments. However, this success rate, while notable, indicates there’s still a significant gap to bridge before fully automating professional work.
The researchers also highlight the nuance of “partial completion.” Even when agents failed to complete a task fully, they often made substantial progress. To account for this, they developed a scoring system that awards partial credit for successful completion of individual steps within a larger task. Considering partial credit, Claude 3.5 Sonnet achieved a score of 34.4%. This emphasizes that while fully automated task completion remains elusive, LLMs are already exhibiting considerable capability in handling parts of complex workflows.
The study also identified common failure modes. One significant hurdle was the complexity of modern web interfaces. LLMs struggled with tasks requiring navigation through multiple website pop-ups and complex UI elements. Similarly, communication with simulated colleagues often proved challenging, highlighting the difficulty LLMs face in understanding and responding appropriately to nuanced social cues.
TheAgentCompany stands apart from previous benchmarks by focusing on the diversity and realism of tasks. It goes beyond simpler, often contrived tasks found in other benchmarks, offering a more realistic testbed for evaluating LLM agents. The benchmark’s open-source nature allows other researchers to build upon its foundation, contributing to the development and advancement of AI agent technology.
In conclusion, while TheAgentCompany shows encouraging progress in LLM agent capabilities, it also clearly demonstrates that there’s still significant room for improvement. The researchers believe their work provides valuable insight into both the potential and limitations of current AI technology, offering a crucial perspective for businesses and policymakers alike as they navigate the complexities of AI adoption in the workplace.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.