RLinf-VLA: A Unified and Efficient Framework for Embodied AI Training
In the rapidly evolving field of embodied artificial intelligence, researchers are increasingly turning to Vision-Language-Action (VLA) models to enable robots to understand, reason, and act in complex environments. However, training these sophisticated models has proven challenging, with current methods often being fragmented and lacking standardized evaluation platforms. This is where RLinf-VLA, a new unified and efficient framework, steps in to revolutionize VLA training.
The core innovation of RLinf-VLA lies in its ability to seamlessly integrate multiple VLA architectures, reinforcement learning (RL) algorithms, and simulation environments. This means researchers can now use a single platform to test and compare different models, such as OpenVLA and OpenVLA-OFT, alongside various RL algorithms like PPO and GRPO, all within diverse simulators like ManiSkill and LIBERO.
One of the key hurdles in VLA training is the efficient utilization of computational resources, particularly GPUs, which are crucial for both rendering the simulation environment and training the AI model. RLinf-VLA addresses this with a flexible GPU allocation system. It offers three distinct modes: “colocated” (all components share GPUs), “disaggregated” (each component gets its own dedicated GPUs), and a novel “hybrid” mode. This hybrid mode, featuring fine-grained pipelining, is particularly effective for GPU-parallelized simulators. It works by breaking down the simulation process into stages that can run concurrently, significantly reducing idle time and boosting training speed by up to 1.61x to 1.88x.
Imagine a robot learning to stack blocks. In a traditional setup, the robot’s “brain” (the VLA model) might have to wait for the simulation to fully update the environment before deciding its next move. With RLinf-VLA’s hybrid mode, while the robot is reaching for one block, the simulator can already be preparing the state for the next block, leading to much faster learning.
Beyond system-level optimizations, RLinf-VLA also incorporates algorithmic enhancements. These include lightweight critics, loss normalization, action masking, and rollout filtering. These techniques collectively contribute to more efficient and stable training.
The empirical results presented in the paper are striking. A single RLinf-VLA trained model achieved an impressive 98.11% success rate on 130 tasks from the LIBERO benchmark and 97.66% on 25 tasks from the ManiSkill benchmark. This demonstrates the framework’s capability to handle large-scale, multi-task learning effectively.
Furthermore, the study highlights the superior performance of RL-trained policies over traditional supervised fine-tuning (SFT) methods. In a real-world experiment with a Franka robot, an RL-trained policy successfully completed a pick-and-place task with previously unseen objects, while an SFT-trained policy failed. This showcases the enhanced generalization capabilities of RL-based VLA training, a critical factor for deploying AI in unpredictable real-world scenarios.
In conclusion, RLinf-VLA offers a unified, efficient, and robust platform that is poised to accelerate research and development in embodied intelligence. By standardizing the training process and providing a comprehensive suite of tools, the framework empowers researchers to explore the full potential of VLA models for creating more capable and adaptable robots. The framework has been open-sourced, aiming to foster community collaboration and drive further advancements in the field.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.