InfiGUIAgent: A Multimodal GUI Agent That Learns to Reason
A team of researchers from Zhejiang University and other institutions have developed InfiGUIAgent, a novel multimodal graphical user interface (GUI) agent that significantly improves upon existing agents by incorporating native reasoning capabilities. Their findings, published in a recent preprint, demonstrate InfiGUIAgent’s superior performance on several benchmark tasks.
Current GUI agents, powered by large language models (LLMs), often struggle with multi-step reasoning and rely heavily on textual annotations of the GUI, limiting their effectiveness and generalizability. InfiGUIAgent addresses these shortcomings through a two-stage supervised fine-tuning (SFT) process.
Stage 1: Building Fundamental Skills
The first stage focuses on strengthening foundational skills. The researchers trained InfiGUIAgent on a diverse set of datasets covering:
-
GUI Understanding: Datasets like Screen2Words and Screen Annotation help the agent learn to recognize and interpret GUI elements, such as buttons, text fields, and icons. For example, the agent learns to identify the “back” button in an Android settings menu and understand its function.
-
Grounding: Datasets such as GUIEnv and RICO Semantic Annotation teach the agent to connect visual elements with their corresponding actions. This means the agent learns that clicking a specific button will trigger a particular response, such as opening a new contact screen.
-
Question Answering: Datasets like ScreenQA train the agent to understand and respond to questions about the GUI’s contents. For example, it can answer questions like, “What day is it tomorrow?”, based on the date shown on the screen.
Stage 2: Integrating Advanced Reasoning
The second stage integrates two crucial reasoning skills:
-
Hierarchical Reasoning: This allows the agent to break down complex tasks into smaller, more manageable subtasks. Imagine the task of creating a new contact. InfiGUIAgent first navigates to the “Contacts” section, then locates and clicks the “Create New Contact” button—a clear demonstration of hierarchical reasoning.
-
Expectation-Reflection Reasoning: This enables the agent to learn from its actions by comparing expected outcomes with actual results. If an action fails, the agent can reflect on why it failed and adjust its strategy accordingly. For example, if attempting to reply to a message, the agent first predicts what should happen after it clicks the “Start Chat” button, and if the expected action does not occur, it learns from this failure.
The researchers synthesized new training data for Stage 2 using LLMs to integrate these reasoning skills, enriching the training dataset with more complex, multi-step scenarios.
Results and Implications
InfiGUIAgent achieved state-of-the-art results on several benchmark datasets, including ScreenSpot and AndroidWorld, significantly outperforming other open-source models, even those with substantially larger parameter sizes. These results highlight the importance of incorporating native reasoning capabilities for building truly effective and robust GUI agents.
The success of InfiGUIAgent represents a significant advance in the field of AI and could have far-reaching implications for automated task completion across various devices and applications. It also highlights the potential of LLMs and future multimodal models in more sophisticated reasoning processes. The researchers’ approach provides a blueprint for future research and development in this rapidly evolving area.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.