RLVR-World: Reinforcement Learning Supercharges World Models
In a significant leap for artificial intelligence, researchers at Tsinghua University have developed a new framework called RLVR-World that leverages reinforcement learning to train more accurate and useful world models. World models are AI systems that predict how environments will change in response to actions, enabling autonomous agents to plan and reason.
The traditional approach to training world models, known as maximum likelihood estimation (MLE), often falls short of producing models that truly align with real-world tasks. For instance, training video models with MLE can result in blurry predictions, while language models can suffer from repetition and hallucination.
RLVR-World addresses these limitations by using reinforcement learning with verifiable rewards (RLVR). Instead of solely focusing on predicting the next step in a sequence, RLVR-World directly optimizes the model for specific metrics like accuracy or perceptual quality.
“Imagine you’re training a robot to stack blocks,” explains Dr. Mingsheng Long, a lead researcher on the project. “With MLE, the robot might learn to predict the general position of the next block, but RLVR-World allows us to reward the robot specifically for precise stacking, leading to much more effective learning.”
The team demonstrated the power of RLVR-World across various modalities, including language and video. In text-based games, RLVR-World improved state prediction accuracy by up to 30.7% compared to MLE-trained models. In web navigation tasks, it boosted the F1 score for web page state prediction by 15.1%.
Moreover, RLVR-World significantly enhanced video world models. When training a robot to manipulate objects, the team saw a 9.2% relative improvement in LPIPS (a metric for perceptual similarity) after only a few hundred RLVR training steps. This is remarkable considering that MLE training would require hundreds of thousands of steps to achieve similar results.
One compelling example highlighted in the paper is the reduction of repetition in video predictions. MLE-trained models often exhibit a tendency to simply repeat elements from previous frames, a form of “hallucination” also seen in language models. RLVR-World’s direct optimization approach mitigates this issue, leading to more realistic and useful video predictions.
The researchers also demonstrated the utility of RLVR-World in downstream applications such as policy evaluation and model-predictive control. By using RLVR-enhanced world models, they were able to create web agents that outperformed those trained with traditional methods.
“RLVR offers a promising post-training paradigm for enhancing the utility of generative models more broadly,” said Dr. Long. This research opens up new possibilities for creating AI systems that can better understand and interact with the world around them.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.