Foundation Models Learn to Browse the Web Autonomously
A new AI system allows large language models to learn new skills without human intervention, significantly improving their ability to handle real-world tasks.
BERKELEY, CA – A team of researchers from UC Berkeley, the University of Illinois Urbana-Champaign, and Amazon have developed a novel AI system called Proposer-Agent-Evaluator (PAE) that empowers foundation models – the large language models (LLMs) and vision-language models (VLMs) driving the current AI boom – to autonomously discover and learn new skills for handling real-world tasks. Their findings, published in a preprint on arXiv, demonstrate a significant leap in the ability of AI agents to generalize to unseen tasks and websites.
The traditional approach to training AI agents involves meticulous manual creation of task instructions. This approach is not only labor-intensive and costly, but also severely limits the scope of tasks an agent can perform. PAE tackles this limitation by allowing agents to learn through self-directed exploration.
The system is composed of three key components:
-
Context-Aware Task Proposer: This module autonomously generates tasks for the agent, leveraging contextual information such as the name of a website or user demonstrations. For example, if the website is Amazon, the proposer might suggest tasks like “Find the price of a specific book” or “Add an item to your shopping cart.” The proposer’s context-awareness is crucial; it can determine feasible tasks within the constraints of a given environment, unlike systems that simply generate random instructions.
-
Agent Policy: This is the core of the AI agent, trained using reinforcement learning (RL). It tries to complete the tasks proposed by the Task Proposer. In the case of web navigation, its actions might include clicking links, typing in search terms, and scrolling. The agent’s actions are grounded in the real world through interaction with a simulated browser environment.
-
Autonomous Evaluator: This component assesses the success or failure of the agent’s attempts. It relies on visual information (screen captures) rather than relying on the hidden internal states of the agent or requiring human judgment. This image-based evaluation provides a robust reward signal for the RL training, guiding the agent’s policy toward improved performance.
The researchers validated PAE using challenging vision-based web navigation tasks. They tested their system on real-world and self-hosted websites, using over 100 domains ranging from Amazon to more specialized sites. The results demonstrated a remarkable improvement in zero-shot generalization. Compared to state-of-the-art open-source agents, PAE achieved more than a 30% relative improvement in zero-shot task success rates on unseen tasks and websites. On the WebVoyager benchmark, the absolute performance improved from 22.6% to 33%, significantly exceeding the performance of other systems that rely on predefined human-annotated tasks. The system even demonstrated the ability to leverage limited user demonstrations to further enhance its performance.
This work represents a significant advance in autonomous skill discovery for foundation model agents. The capability of PAE to improve zero-shot generalization to unseen tasks and websites showcases the potential of this approach to enable the development of more broadly capable and general-purpose AI agents in the future. The researchers have released their open-source checkpoints and code, encouraging further exploration and development within the AI community. The key to PAE’s success lies in its ability to efficiently leverage the strengths of foundation models as both skill proposers and evaluators, thus bootstrapping the agent’s capabilities and enabling a higher level of generalization.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.