Self-Evolving Mobile Assistant Outperforms State-of-the-Art in Complex Tasks
A new research paper introduces Mobile-Agent-E, a hierarchical multi-agent mobile assistant that significantly outperforms existing state-of-the-art (SOTA) approaches in tackling complex, real-world tasks on smartphones. The key innovation is its self-evolution module, which allows the agent to learn from past experiences and continuously improve its performance and efficiency.
Current mobile assistants often struggle with tasks requiring multiple app interactions and long-term planning. For example, imagine trying to find the cheapest Nintendo Switch Joy-Con across Amazon, Walmart, and Best Buy, stopping only when the cheapest option is ready to be added to your cart. This involves navigating multiple apps, comparing prices, and making decisions based on real-time information. Existing systems often fail in such scenarios due to limited reasoning capabilities and an inability to learn from previous attempts.
Mobile-Agent-E overcomes these limitations through a novel hierarchical framework. It separates high-level planning from low-level action execution, assigning dedicated agents to each function:
- Manager: Creates high-level plans, breaking down complex tasks into smaller, manageable subgoals.
- Perceptor: Extracts fine-grained visual information from the phone screen (text, icons, etc.).
- Operator: Executes actions based on the plan (e.g., tapping, swiping, typing).
- Action Reflector: Verifies action outcomes, detecting and flagging errors.
- Notetaker: Collects and aggregates important information throughout the task.
The core innovation, however, is the self-evolution module. This module maintains a persistent long-term memory, storing two types of information learned from past experiences:
- Tips: General guidance and lessons learned on effective interaction strategies (e.g., “If a search action fails, verify the input text and ensure the correct search bar is targeted before retrying”).
- Shortcuts: Reusable sequences of atomic operations for frequently occurring subroutines (e.g., a sequence of actions to search and enter text into a field).
These Tips and Shortcuts are dynamically refined after each completed task, allowing Mobile-Agent-E to continuously learn and adapt. For instance, if the agent initially struggles to find a specific app, it may learn a Tip about alternative navigation methods and add it to its memory. If a certain sequence of actions is repeatedly used (like searching), it can be encapsulated as a Shortcut.
The researchers evaluated Mobile-Agent-E using Mobile-Eval-E, a newly introduced benchmark comprising 25 complex, multi-app tasks. These tasks surpass the complexity of existing benchmarks, requiring significantly more steps and showcasing a wider range of app interactions.
The results demonstrate a significant performance improvement. Mobile-Agent-E achieves a 22% absolute improvement in “Satisfaction Score” (a metric reflecting human judgment of task completion and user satisfaction) over SOTA approaches. Furthermore, enabling the self-evolution module leads to an additional 6.5% gain. The system also demonstrates improved efficiency, requiring fewer steps to achieve the same level of satisfaction.
The paper concludes by suggesting future directions, including improved shortcut generation, personalized strategies, and enhanced safety protocols to mitigate potential risks associated with autonomous agents. The development of Mobile-Agent-E represents a substantial advance in creating more robust and user-friendly mobile assistants that can handle the complexities of real-world mobile tasks.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.