AndroidLab: A Framework for Training and Evaluating Android Agents
📄 Full Paper
💬 Ask
The development of autonomous agents that can interact with real-world environments, such as mobile operating systems, is a significant research challenge. This paper introduces ANDROIDLAB, a systematic framework for training and evaluating Android agents. It addresses the limitations of existing benchmarks, which often rely on static environments or closed-source models, and offers a more comprehensive evaluation approach.
The core of ANDROIDLAB is a standard operational environment that mimics the interactions users have with Android devices. This environment features:
- Operation Modes: Two operation modes are defined: SoM (Set-of-Mark) for multimodal models that can process visual information and XML for text-only models. These modes ensure consistent actions and objects across different input formats, enabling fair comparisons.
- Action Space: A standardized action space is defined for agents to interact with the environment, including operations like Tap, Swipe, Type, Long Press, Home, Back, and Finish.
- Reproducible Benchmark: A comprehensive benchmark with 138 tasks across nine apps, including both operation and query tasks, provides a challenging evaluation platform.
To address the lack of open-source training data for Android agents, the paper presents Android Instruct, a dataset with 94.3k operation records. This dataset supports both text-only and multimodal training and is used to fine-tune six open-source models, resulting in significant performance improvements.
The evaluation metrics used in ANDROIDLAB are carefully designed to assess various aspects of agent performance:
- Success Rate: Measures the percentage of tasks successfully completed.
- Sub-Goal Success Rate: Evaluates the completion rate of sub-goals within a task.
- Reversed Redundancy Ratio: Measures the efficiency of agents by comparing the number of operations taken to a human benchmark.
- Reasonable Operation Ratio: Evaluates the effectiveness of actions by measuring the proportion of operations that result in visible screen changes.
The results demonstrate that fine-tuning open-source models with Android Instruct significantly improves their performance across different metrics, with some models even surpassing closed-source models in certain aspects. The paper highlights the potential of open-source models for developing effective Android agents, particularly when combined with a systematic framework like ANDROIDLAB.
The authors also examine the impact of different agent frameworks, such as ReAct and SeeAct, on model performance. Their findings suggest that ReAct is particularly effective in XML mode, while SeeAct does not consistently improve performance.
Overall, ANDROIDLAB is a valuable contribution to the field of mobile agent research. It provides a standardized environment and benchmark for training and evaluating Android agents, addressing the limitations of existing solutions. The open-source nature of the framework and dataset allows for wider adoption and collaboration within the research community.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.