Open Web Voyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization
๐ Full Paper
๐ฌ Ask
The rapid development of large language models has spurred interest in using them to create autonomous agents that can handle real-world scenarios. However, current open-source efforts are limited by their reliance on text-only agents trained in synthetic environments with clearly defined rewards. These agents struggle to generalize to real-world settings that require multimodal perception and lack ground-truth signals.
A new paper introduces Open Web Voyager, an open-source framework designed to address these limitations. Open Web Voyager facilitates the development of multimodal web agents that can autonomously explore the real web, collect feedback, and improve their performance over time.
The framework operates in two phases:
- Imitation Learning: The agent first learns basic abilities by imitating the actions of a pre-trained multimodal agent called WebVoyager-40, which is based on GPT-40.
- Exploration-Feedback-Optimization Cycle: The agent is then allowed to explore the open web and collect feedback on its trajectories. This feedback is provided by another general-purpose language model, in this case, GPT-40, which evaluates the quality of the trajectories. The agent then uses this feedback to improve its policy by learning from well-performing trajectories. This exploration-feedback-optimization cycle can continue for multiple iterations, allowing the agent to progressively improve its performance.
For example, if the agent is tasked with finding the latest release of a specific software on GitHub, it may initially fail to find the correct information. However, after exploring the web and receiving feedback from GPT-40, it might learn that using a search engine can be more effective in such scenarios. This feedback helps the agent to improve its navigation strategies and avoid similar errors in the future.
To demonstrate the effectiveness of Open Web Voyager, the authors conducted experiments on several benchmark datasets, including WebVoyager and Mind2Web. The results showed that the agent significantly improved its performance across multiple test sets after each iteration of the exploration-feedback-optimization cycle. They also observed that the agent was able to generalize to some extent to unseen websites, suggesting that the approach has the potential to create agents that are more robust and adaptable to real-world environments.
Open Web Voyager addresses several key limitations of existing open-source web agents:
- Multimodal Perception: The agent can process both visual and textual information, making it more effective in real-world web environments where visual information is often crucial.
- Real-World Exploration: By exploring the actual web, the agent learns to handle the complexities of real-world scenarios, which are often difficult to replicate in synthetic environments.
- Iterative Improvement: The agentโs performance is continually improved through a feedback loop that allows it to learn from its mistakes and adapt to new situations.
Open Web Voyager is a promising approach for building autonomous multimodal web agents that can handle real-world tasks. As the framework is open-source, it is expected to encourage further research and development in this area, leading to the creation of more sophisticated and capable web agents that can assist humans in a variety of ways.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.