How AI Is Learning to "Watch" Only What Matters: Meet EVA
In the world of artificial intelligence, processing video has long been a “drowning in data” problem. To understand a two-hour film or a complex security feed, most current AI models act as passive observers. They either attempt to ingest every single frame—which is prohibitively expensive—or they look at a “highlight reel” of uniformly sampled frames, which often results in the AI missing the most critical split-second of action.
A team of researchers from SenseTime Research has unveiled a new framework called EVA (Efficient Reinforcement Learning for End-to-End Video Agent) that changes the paradigm from passive watching to active investigation. Instead of just seeing what it’s given, EVA autonomously decides what to watch, when to watch it, and how much detail it needs to see.
The “Planning-Before-Perception” Shift
The core innovation of EVA is a “planning-before-perception” workflow. Traditional models look first and think later. EVA does the opposite: it reads a user’s question and formulates a plan before it ever touches the video file.
The researchers describe this as an iterative loop of “summary–plan–action–reflection.” To build an intuition for how this works, imagine you are asked to find the exact moment a soccer ball crosses the goal line in a ten-minute clip.
A traditional AI might look at one frame every ten seconds. If the goal happened at second 45, the AI might only see frames at second 40 and second 50, missing the event entirely. EVA, however, would first “think” that it needs to find a crowd celebration. It might scan the whole video at a very low resolution to find where the players start cheering. Once it identifies that window, it “zooms in,” requesting high-resolution, high-frame-rate data for just those few seconds to confirm the ball’s trajectory.
Learning from Mistakes
Training an AI to be this selective is difficult. If the model is too aggressive, it misses details; if it’s too cautious, it wastes computational power. The researchers developed a three-stage training pipeline to strike this balance.
First, they used “Supervised Fine-Tuning” to give the model a basic understanding of how to use video-seeking tools. Next, they applied Kahneman–Tversky Optimization (KTO), a method that helps the model learn from its own failures. By looking at “trajectories” of reasoning that led to wrong answers—such as jumping to a conclusion without enough visual evidence—EVA learned to prefer strategies that actually work.
Finally, the team used Generalized Reward Policy Optimization (GRPO). This incentivized the model to be both accurate and efficient, essentially rewarding the AI for getting the right answer using the smallest number of “visual tokens” possible.
Efficiency Meets Accuracy
The results are striking. In tests across six major video-understanding benchmarks, EVA outperformed traditional models by 6% to 12%. More impressively, it achieved these results while using significantly less data. On the “Sampling Dilemma” benchmark, EVA reached 51% accuracy using only about 10,000 visual tokens; for comparison, some top-tier closed-source models required nearly 700,000 tokens to achieve similar performance.
By transforming AI from a passive recipient of pixels into an active, strategic “watcher,” EVA paves the way for smarter, cheaper, and faster video analysis in everything from autonomous robotics to long-form film search. As the researchers put it, EVA isn’t just a recognizer—it’s an autonomous agent that knows how to look.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.