AI Papers Reader

Personalized digests of latest AI research

View on GitHub

LLMs Struggle to Collaborate Over Time: A New Benchmark Highlights Limitations

Large language models (LLMs) are increasingly integrated into workplaces, demonstrating impressive abilities to solve individual problems. However, a new research paper, “From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions,” reveals a significant limitation: current LLMs struggle to effectively collaborate in long-term interactions. The research team, from Universitat Pompeu Fabra, University of Amsterdam, and Cohere, introduces MEMORYCODE, a novel synthetic dataset designed to evaluate this crucial aspect of LLM capabilities.

MEMORYCODE simulates realistic multi-session interactions between a mentor and a mentee in a professional coding environment. The mentor provides instructions interspersed with irrelevant office chatter across multiple sessions. The mentee—an LLM—must accurately execute coding tasks based on information learned across this extended interaction. A simplified example depicts a mentee learning a coding convention on Day 1 (always add the date to function names). Irrelevant office discussions occur on Day 5. On Day 20, the mentee is tasked with writing a function; success depends on recalling and applying the Day 1 convention.

The researchers tested several state-of-the-art LLMs, including GPT-40, Llama-3.1 models, and Cohere’s Command R+. When presented with isolated instructions (e.g., “write a function…”), all models performed exceptionally well, achieving near-perfect accuracy. This showcases LLMs’ impressive ability to follow simple coding instructions in isolation.

However, performance dramatically decreased when instructions were spread across multiple sessions. Even powerful models like GPT-40 experienced a significant drop (67%) in accuracy when presented with the entire dialogue history. Smaller models fared even worse. This highlights a critical gap: LLMs struggle to effectively retrieve and integrate information across extended conversational sequences. This difficulty is not a mere retrieval problem; even when provided only with the relevant instructions, the models failed, indicating a reasoning deficit. The researchers posit that LLMs lack the necessary prospective memory and information integration capabilities to successfully act as long-term collaborators.

The MEMORYCODE dataset, released under the Apache 2.0 license, offers a valuable benchmark for future research. It directly addresses the need to develop sophisticated memory mechanisms and advanced reasoning abilities within LLMs. The current results demonstrate a fundamental limitation of existing LLMs that severely restricts their ability to effectively collaborate in realistic, long-term professional settings. The study points to a need for innovation beyond simply increasing model size, suggesting that novel architectural and training methodologies are essential to creating truly collaborative LLMs. While synthetic data has inherent limitations (lacking the nuances of genuine human interactions), MEMORYCODE provides a rigorously controlled setting to isolate and precisely evaluate long-term memory and reasoning abilities, a crucial step toward improving LLM capabilities.