AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Generative Policies Break the Stability Barrier in Online Reinforcement Learning with New GoRL Framework

Reinforcement learning (RL) has long been hampered by a critical trade-off: policies that are stable enough for robust online training are often too simple to represent complex behaviors. Conversely, expressive generative models, capable of modeling intricate, multimodal actions, tend to collapse when optimized online due to noisy gradients and intractable likelihoods.

Researchers from Beijing University of Posts and Telecommunications, Nanyang Technological University, and A*STAR have resolved this fundamental tension by introducing Generative Online Reinforcement Learning (GoRL), a novel framework that structurally separates stable optimization from expressive action generation.

The core insight behind GoRL is a technique called latent-generative factorization. Instead of optimizing the complex generative model directly in action space (which leads to instability), GoRL breaks the policy into two parts: an Encoder and a Decoder. The Encoder is a simple, tractable Gaussian policy operating over a stable latent space. This policy governs the agent’s high-level intent and is optimized using proven, stable methods like Proximal Policy Optimization (PPO), which relies on analytical likelihoods.

The complexity is delegated to the Decoder, a high-capacity generative model (such as Flow Matching or Diffusion models) responsible for synthesizing specific, expressive actions from the learned latent intent.

GoRL’s stability is maintained through a two-timescale alternating optimization schedule. First, the latent Encoder is updated for many steps using standard RL to discover new high-reward behavioral strategies. Crucially, this optimization is anchored by a KL-regularization term, preventing the latent policy from drifting into unstable regions. Once the Encoder has maximized return, it is frozen. Next, the generative Decoder is refined using a supervised learning objective, mapping a fixed, stable Gaussian noise distribution to the newly discovered high-reward actions.

This separation creates a virtuous cycle: the stable latent policy discovers robust control intents, and the expressive generative decoder absorbs these intentions, allowing the overall system to acquire increasingly complex motor skills without sacrificing optimization stability.

The framework’s effectiveness was demonstrated across challenging continuous-control tasks from the DMControl Suite. On the highly non-linear HopperStand task, which requires balancing the rigid biped on a fine edge, GoRL achieved a normalized return exceeding 870—more than three times the performance of the strongest baseline methods, including recent generative approaches like Flow Policy Optimization (FPO) and Diffusion PPO (DPPO).

This exceptional performance stems from GoRL’s ability to develop multimodal action distributions. Traditional policies are often forced to find an average, unimodal solution, which can be suboptimal (the “mode-covering” problem). For instance, in complex balancing acts, an optimal policy might involve sharply pushing left or sharply pushing right. GoRL learns this bimodal structure naturally. While standard Gaussian PPO remained limited to a single broad action peak, GoRL, over the course of training, clearly evolved two separated peaks, reflecting its mastery of distinct, high-return control strategies.

By providing the first algorithm-agnostic framework that guarantees stable latent optimization alongside high-expressiveness generative modeling in online RL, GoRL opens a practical pathway for deploying cutting-edge generative AI techniques in dynamic real-world robotics and control systems.