AI Papers Reader

Personalized digests of latest AI research

View on GitHub

MobiAgent: A New Framework for Smarter Mobile Assistants

Shanghai, China – Researchers at Shanghai Jiao Tong University have unveiled MobiAgent, a novel framework designed to significantly enhance the capabilities of mobile agents. These intelligent assistants, powered by advanced Vision-Language Models (VLMs), can now perform real-world tasks on smartphones with greater accuracy and efficiency than ever before.

MobiAgent addresses key limitations in existing mobile agent technology, which often struggle with task completion rates, response times, and handling unexpected situations. The system comprises three core components: the MobiMind-series agent models, the AgentRR acceleration framework, and the MobiFlow benchmarking suite.

At the heart of MobiAgent are the MobiMind models, which feature a modular design separating task planning, decision-making, and execution. This architecture allows for seamless integration with various backend operation modes, such as graphical user interfaces (GUIs) and XML data.

To tackle the challenge of slow response times, the AgentRR acceleration framework plays a crucial role. AgentRR intelligently records and abstracts past task executions into “experiences.” A lightweight memory model then determines if these past experiences can be reused, significantly reducing the computational load on the agent models. Imagine an agent trying to book a train ticket multiple times; with AgentRR, it can learn from previous successful bookings, quickly identifying the necessary steps like entering destination and dates, and potentially skipping redundant reasoning.

A significant hurdle for developing capable agents has been the lack of high-quality, real-world task data. MobiAgent introduces an AI-assisted data collection pipeline that streamlines the process of gathering and annotating this data, drastically reducing manual effort. This pipeline collects detailed user interactions, including clicks, text inputs, and swipes, and even reconstructs the agent’s reasoning process using a VLM.

To rigorously evaluate these advancements, the researchers developed MobiFlow, a sophisticated benchmarking framework. Unlike existing benchmarks that can be overly simplistic, MobiFlow utilizes Directed Acyclic Graphs (DAGs) to model complex dependencies and sequential constraints found in real-world mobile application tasks. This allows for a more accurate and fine-grained assessment of agent performance, even when multiple correct ways exist to complete a task. For instance, searching for a product online might involve different sequences of clicks and text entries, all of which MobiFlow can evaluate effectively.

Experimental results using the MobiFlow benchmark demonstrate that MobiAgent, specifically the combination of MobiMind-Decider-7B and MobiMind-Grounder-3B, outperforms both general-purpose LLMs like GPT-5 and Gemini 2.5 Pro, as well as other specialized mobile agent models. MobiAgent exhibits superior instruction following, generates more insightful reasoning, and achieves more reliable task termination, even in complex scenarios like online shopping or food delivery.

The MobiAgent system represents a significant step forward in creating truly intelligent and practical mobile assistants, making our interactions with smartphones more efficient and intuitive.