AI Agent Learns Computer Use From Limited Data, Outperforms Strong Models
A new AI agent, PC Agent-E, developed by researchers at Shanghai Jiao Tong University and SII, is showing impressive results in learning how to use computers. The agent, outlined in a recent paper, achieves remarkable performance with minimal training data, outperforming even powerful models like Claude 3.7 Sonnet.
The key to PC Agent-E’s success lies in a novel training framework that leverages a small dataset of human-annotated computer use trajectories and augments it with AI-generated data. The researchers started with just 312 human demonstrations of various computer tasks.
“Collecting high-quality trajectory data has long been a bottleneck in developing human-like computer use agents,” explains Pengfei Liu, a corresponding author on the paper.
To overcome this data scarcity, the team introduced a method called “Trajectory Boost.” This process uses a strong AI model (Claude 3.7 Sonnet) to analyze the human trajectories and synthesize alternative action decisions at each step.
For instance, consider a user trying to open a specific application. A human demonstrator might click on the application icon directly. Trajectory Boost, however, would generate several other potentially valid actions, like searching for the application in the start menu or using a keyboard shortcut.
These synthesized actions effectively diversify the training data, exposing the agent to a broader range of possibilities and improving its ability to generalize.
“The insight is that many computer tasks can be completed through multiple valid pathways,” says Yanheng He, co-lead author on the project.
The PC Agent-E model, trained on this augmented data, achieved a 141% relative improvement compared to a baseline model. It also surpassed the performance of Claude 3.7 Sonnet with extended thinking on an improved benchmark called WindowsAgentArena-V2, which tests an agent’s ability to perform tasks in a Windows environment.
Furthermore, the agent showcased strong generalizability, successfully adapting to different operating systems, including Linux-based environments evaluated using the OSWorld benchmark. This is significant because it suggests that the learned skills are transferable across platforms, rather than being specific to a single operating system.
One key aspect of the work is the meticulous data decontamination process. The researchers compared tasks against a held-out evaluation benchmark using n-gram overlap and semantic similarity to remove any data leakage that could lead to inflated results. They also identified and addressed limitations in the original WindowsAgentArena benchmark, such as evaluation dependency and infeasible hacking vulnerabilities.
The researchers are releasing their code, data, and models to facilitate future research in this area. This open-source approach is expected to accelerate progress in developing more capable and efficient computer use agents.
The work has implications for automating a wide range of tasks, reducing human workload, and unlocking new capabilities in artificial intelligence. While current models often fall short of human performance due to limited computer knowledge and planning abilities, PC Agent-E demonstrates that significant improvements can be achieved with a focus on high-quality, diverse training data.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.