AI Papers Reader

Personalized digests of latest AI research

View on GitHub

AI Learns to "See" the Future: The New "Chain of Events" Paradigm for Video Prediction

Current Multimodal Large Language Models (MLLMs) are remarkably good at describing what is happening in a video, but they often stumble when asked a simple question: “What happens next?” A new research paper from Alibaba Group’s AMAP team introduces Video-CoE, a paradigm designed to bridge this gap by teaching AI to build a logical “Chain of Events” (CoE) before making a prediction.

The Guessing Game Problem

Predicting future events is more than just recognizing objects; it requires fine-grained temporal reasoning. The researchers found that state-of-the-art models, including GPT-4o, often perform poorly on Video Event Prediction (VEP).

The failure stems from two main issues. First, current models suffer from “textual bias”—they tend to look at the multiple-choice options provided in a test and guess the answer based on word patterns rather than actually watching the video. For example, if a video shows a surfer riding a wave and the options include “the surfer concludes their ride,” the AI might pick it simply because it sounds like a plausible ending, ignoring visual cues like a logo appearing that suggests a transition to a commercial.

Second, the study found that these models barely “pay attention” to visual tokens. Analysis of the models’ internal attention distributions showed they focus overwhelmingly on the text of the question and options, essentially treating the video as background noise.

Building the Chain

To fix this, Video-CoE introduces a “Chain of Events” requirement. Instead of jumping straight to a prediction, the model is trained to first segment the video into a chronological sequence of historical events.

Imagine a video of someone making a pasta salad. A standard AI might see the ingredients and guess “eating” is the next step. Video-CoE, however, would be forced to first identify:

  1. Event 1: Chopping onions.
  2. Event 2: Slicing bell peppers.
  3. Event 3: Boiling water.

By explicitly laying out this chain, the model builds a logical foundation. It “realizes” that because the vegetables are prepped but the dressing hasn’t been mixed, the next logical step is blending the vinegar and oil, not eating the finished meal.

Reinforcing Logic

The researchers implemented this using a two-stage training process. First, they used CoE-SFT (Supervised Fine-Tuning) to teach the model how to connect video content to future events through logical reasoning.

The second stage, CoE-GRPO, uses reinforcement learning. The model is rewarded not just for getting the answer right, but for how well its “Chain of Events” aligns with the actual video. Specifically, the system uses a “similarity reward”—it crops the video based on the timestamps the AI provides and checks if the visual content actually matches the AI’s description. If the AI claims it saw “chopping onions” at the five-second mark, but the video shows a boiling pot, it loses points.

A New State-of-the-Art

The results are significant. On benchmarks like FutureBench and AVEP, Video-CoE outperformed both leading open-source models and commercial giants like GPT-4o. Most importantly, the researchers proved that Video-CoE successfully shifted the AI’s focus: the models showed a massive increase in the attention they paid to visual tokens.

By forcing AI to show its work through a temporal chain, Video-CoE moves the field closer to “anticipatory AI”—systems that can move beyond simple observation to provide early warnings in crisis scenarios or assist in complex real-world decision-making.