AI Papers Reader

Personalized digests of latest AI research

View on GitHub

STEVE: AI Agent Training Pipeline Uses Step Verification for Enhanced Computer Interaction

A new approach called STEVE aims to improve how AI agents interact with computer graphical user interfaces (GUIs), mimicking human actions. STEVE, short for Step Verification, tackles the challenge of training AI to perform complex tasks on digital devices by verifying the correctness of each action an agent takes.

Traditionally, training AI agents to use computers requires vast amounts of high-quality, manually annotated data, which is expensive and time-consuming. STEVE addresses this issue by automating the verification process using a large vision-language model (VLM), specifically GPT-40, to judge the success of each step an AI agent takes.

Here’s how STEVE works:

  1. UI Grounding Model Training: The system starts by training a VLM to understand GUI elements. This is done using a massive dataset of web pages and desktop screenshots. For example, the system would learn to recognize and locate buttons, text boxes, and other UI components.

  2. Trajectory Data Collection: An AI agent with limited capabilities is deployed in a live Windows environment to perform tasks. This agent generates a series of actions, creating a “trajectory” for each task. For example, if the task is to “search the latest news on AI”, the trajectory might include actions like “click the Chrome icon,” “type ‘latest news on AI’ into the search bar,” and “press Enter.”

  3. Step Verification with GPT-40: GPT-40 acts as a judge, evaluating each action in the trajectory. By comparing screenshots before and after the action, GPT-40 determines whether the action was correct and beneficial to completing the task. Each step is assigned a binary label, indicating success or failure.

  4. Kahneman-Tversky Optimization (KTO): STEVE then optimizes the agent using Kahneman & Tversky Optimization (KTO). This algorithm leverages both positive and negative examples (i.e., successful and unsuccessful steps) to train the agent more effectively. In the “latest news on AI” example, if the agent misclicks a link, GPT-40 would label this as a negative action, and KTO would help the agent learn to avoid such mistakes in the future.

The researchers conducted extensive experiments and found that AI agents trained with STEVE outperform agents trained using traditional supervised finetuning methods. Specifically, STEVE allows training a 7B vision-language model that achieves leading performance in the WinAgentArena live desktop environment with great efficiency at a reduced cost. The trained agents effectively leverage both positive and negative samples, improving their ability to interact with GUIs and reducing degradation in UI localization. STEVE demonstrated a 22% task success rate on WinAgentArena, surpassing state-of-the-art results.