Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

🔊

💬 Ask

Researchers from Carnegie Mellon University, MIT, New York University, and Microsoft have created VideoWebArena, a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. They found that existing models fall short of human performance, highlighting the need to improve the agentic abilities of long-context multimodal models.

VideoWebArena (VideoWA) consists of 2,021 web agent tasks based on manually crafted video tutorials that total almost four hours of content. The benchmark focuses on two key areas: skill retention and factual retention.

Skill retention tasks evaluate whether an agent can use a given human demonstration to complete a task efficiently.
Factual retention tasks evaluate whether an agent can retrieve instruction-relevant information from a video to complete a task.

VideoWA includes six key thematic environments created by VisualWebArena and WebArena, including Reddit, Classifieds, Shopping, Shopping Admin, Map, and Gitlab. The benchmark covers a wide variety of tasks, with an example being “Buy the cheapest red blanket from Blankets & Throws.” This task requires the agent to understand a video tutorial about online shopping and apply that knowledge to navigate a website and find the desired product.

VideoWA is designed to test the following agentic abilities:

Multimodal reasoning: The ability to combine information from different modalities (text, images, and video) to complete a task.
Memory retention: The ability to remember information from a video over time.
Information retrieval: The ability to extract relevant information from a video to complete a task.
Skill retention: The ability to learn from a given human demonstration to complete a task efficiently.

The authors evaluated several video-capable state-of-the-art LLMs, namely GPT-40 and Gemini 1.5 Pro, on VideoWA. They found that while these models can serve in a limited capacity as video-capable agents, they are still a far reach from human levels of performance.

For example, the best model achieved only 13.3% success on factual retention tasks and 45.8% success on factual retention QA pairs, far below human performance at 73.9% and 79.3%, respectively. This highlights the need for further research to improve the performance of long-context video agents.

The authors propose three baseline agents:

Video Summary Agent: The video summary is fed in-context.
Video Frame Agent: A set number of frames and audio transcription is fed in-context.
Video Agent: The video is fed in-context.

The authors found that there is no winning baseline agent or model family across skill and factual retention tasks. This highlights the need for more research into developing robust and effective video-capable agents.

VideoWA is a valuable resource for researchers developing long-context video agents and provides a standardized way to measure progress in this area.

AI Papers Reader

Personalized digests of latest AI research

Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Chat about this paper