AI Papers Reader

Personalized digests of latest AI research

View on GitHub

'Look Before You Leap': AI Model Diagnoses Errors in GUI Automation Before They Happen

Multimodal Large Language Models (MLLMs) are increasingly being used for automating interactions with graphical user interfaces (GUIs). However, errors in GUI automation can have cumulative and sometimes irreversible consequences, such as accidental file deletions or incorrect payments. To mitigate these risks, researchers have developed a new “pre-operative critic” model called GUI-Critic-R1 that provides feedback on proposed actions before they are executed.

The GUI-Critic-R1 model, detailed in a recent paper, aims to improve the reliability and efficiency of GUI automation by identifying potential errors and suggesting corrective measures. It does so by analyzing the current screen state and the proposed action, reasoning about potential outcomes, and providing feedback to the MLLM-driven agent.

For instance, imagine a scenario where a user instructs the agent to rename an audio file. The agent mistakenly predicts clicking the “delete” button instead. The GUI-Critic-R1 would catch this error before execution, preventing the file deletion. The model provides feedback and an explanation for the error, improving the agent’s decision-making.

A key innovation is the “Suggestion-aware Gradient Relative Policy Optimization” (S-GRPO) strategy, which incorporates a “suggestion reward” to ensure the reliability of the model’s feedback. This helps the model provide more accurate and helpful suggestions for correcting errors.

To train and test the GUI-Critic-R1, the researchers created two new datasets: GUI-Critic-Train and GUI-Critic-Test. These datasets fill existing gaps in GUI critic data and include a variety of mobile and web GUI scenarios. The GUI-Critic-Train dataset contains chain-of-thought data where the model analyzes GUI actions and explains why they are correct or incorrect. The GUI-Critic-Test dataset evaluates the pre-critic model’s ability to diagnose errors when exposed to new instructions and applications.

Experiments on the GUI-Critic-Test dataset showed that GUI-Critic-R1 significantly outperformed existing MLLMs in terms of critic accuracy. When integrated into a GUI automation benchmark called AndroidWorld, the GUI-Critic-R1 improved success rates and operational efficiency, indicating its effectiveness in real-world scenarios.

For example, in the AndroidWorld benchmark, GUI-Critic-R1 increased the success rate of GUI automation from 22.4% to 27.6%. Furthermore, an analysis of GUI task execution showed the model completes tasks in fewer steps, resulting in increased efficiency.

The researchers found that the model can avoid potential mistakes and also identify optimal paths for completing instructions. For example, instead of opening Bluetooth through the Settings app, it can suggest enabling it via the control center.

The GUI-Critic-R1 model represents a significant step forward in GUI automation, offering a proactive approach to error diagnosis and correction. By providing feedback before actions are executed, it can prevent potentially disruptive outcomes and enhance the overall reliability and efficiency of GUI agents. The code for the GUI-Critic-R1 is available on GitHub at https://github.com/X-PLUG/MobileAgent/tree/main/GUI-Critic-R1.