AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Balancing the Scales: New "MARBLE" Framework Helps AI Image Generators Master Multiple Skills at Once

When we ask an AI to generate an image of a “medieval knight holding a sign that says ‘Welcome Home’ in a flower garden,” we aren’t just looking for one thing. We want the knight to look realistic (aesthetics), the garden to match the prompt (alignment), and the text to be spelled correctly (OCR).

In the world of Reinforcement Learning (RL) fine-tuning, these requirements are known as “rewards.” Until now, training AI to satisfy all of them simultaneously has been a frustrating balancing act. Most developers either train separate models for each skill or try to mash all the rewards together into a single “weighted sum” score. However, a new paper from researchers at Zhejiang University and HiThink introduces MARBLE, a framework that treats these competing goals like a diplomatic negotiation rather than a simple math problem.

The Problem of the “Specialist Sample”

The core issue with current training methods is what the researchers call the “specialist sample” phenomenon. Imagine you are training an AI using a picture of a cat. This image is great for teaching the AI about fur textures and lighting (aesthetic rewards), but it contains no text, meaning it is useless for teaching the AI how to spell (text rewards).

If you use a traditional “weighted sum” approach, the neutral signal from the spelling reward effectively “waters down” the strong signal from the aesthetic reward. The model gets confused, receiving a diluted, mediocre instruction. In fact, the researchers found that in 80% of training batches, the standard approach actually pushes the model away from at least one of its goals.

How MARBLE Negotiates Excellence

MARBLE (Multi-Aspect Reward BaLancE) solves this by moving the conflict resolution from the “score” stage to the “gradient” stage. Instead of collapsing all feedback into a single number, MARBLE calculates a specific “push” (a gradient) for every individual reward.

Think of it like a group of people trying to move a heavy sofa through a narrow door. One person wants to push left, and another wants to push forward. A “weighted sum” approach might result in a shove that hits the doorframe. MARBLE, however, uses a mathematical technique called Quadratic Programming to find a “common descent direction”—the precise angle that moves the sofa forward in a way that satisfies everyone’s requirements.

To keep this process efficient, the team introduced “amortized” optimization. Performing complex math for five or more rewards every millisecond would normally grind training to a halt. MARBLE instead calculates the best “negotiated direction” every few steps and smoothly applies it to the rest, allowing it to run at 97% of the speed of much simpler, less effective methods.

Proving the Results

Testing MARBLE on the state-of-the-art Stable Diffusion 3.5 Medium model, the researchers optimized for five distinct rewards, including aesthetic appeal and prompt faithfulness. Unlike previous methods that often sacrificed one skill to improve another, MARBLE improved all five dimensions simultaneously.

Qualitative results showed that while older methods often produced blurry text or ignored complex spatial instructions (like “a kite above a toothbrush”), MARBLE consistently generated sharp, high-contrast images that followed every part of the user’s prompt. By treating AI “quality” as the multi-dimensional challenge it truly is, MARBLE provides a scalable blueprint for the next generation of creative AI.