Why Your AI Is Reciting Scripts: Understanding "Template Collapse" in Reinforcement Learning

🔊

💬 Ask

Artificial intelligence has entered its “reasoning era,” where models don’t just spit out answers but “think out loud” through complex chains of thought. However, a new study from researchers at Northwestern, Stanford, and several other institutions reveals a silent failure mode in how we train these agents. Even when a model appears to be thinking deeply, it may actually be stuck in a “template collapse”—reciting sophisticated-looking scripts that have nothing to do with the problem at hand.

The paper, titled RAGEN-2: Reasoning Collapse in Agentic RL, identifies a critical flaw in how researchers monitor Reinforcement Learning (RL). Usually, developers track “entropy”—a measure of how diverse a model’s responses are. High entropy is generally seen as a sign of a healthy, “creative” model.

But the RAGEN-2 team discovered that entropy is a deceptive metric. A model can produce long, complex reasoning chains that look diverse but are actually “input-agnostic.”

The Student Analogy

To build an intuition for this, imagine a student taking a math test. A “collapsed” student might start every answer with: “First, I will carefully examine the numerical variables, then I will apply the relevant theorem to synthesize a solution, and finally, I will verify the result for consistency.”

To a teacher looking only at a single answer, this looks like high-quality reasoning (high entropy). But if the student writes that exact same paragraph for a geometry proof, a calculus problem, and a simple addition task, they aren’t actually thinking. They’ve just learned a “template” that sounds smart enough to occasionally stumble onto the right answer.

The researchers call this “template collapse.” Because the reasoning looks fluent, it bypasses standard filters, silently eroding the AI’s ability to solve real-world tasks.

The Signal-to-Noise Problem

Why does this happen? The researchers explain it through the lens of the Signal-to-Noise Ratio (SNR). During training, the “signal” comes from the difference in rewards: the model learns because it sees that one way of thinking led to success while another led to failure.

However, RL training also involves “noise”—mathematical constraints (like KL divergence) used to keep the model stable. If a particular problem is too easy or too hard, the model doesn’t see much difference in rewards (low signal). In these cases, the “noise” of the training process takes over. The model essentially gives up on the specific details of the prompt and drifts toward a generic, safe template that satisfies the training algorithm’s hunger for stability without actually doing the work.

A New Diagnostic and Cure

To fix this, the team introduced two major innovations. First, they moved beyond entropy to a metric called “Mutual Information” (MI). MI measures how much the reasoning chain actually depends on the input. If you can’t look at the AI’s “thoughts” and guess what the original question was, the model has collapsed.

Second, they developed “SNR-Aware Filtering.” Instead of training on every prompt, the system evaluates the “reward variance” of each task. If the model is getting roughly the same reward regardless of what it tries, that prompt is discarded as “low signal.” Training is then concentrated only on “high-signal” prompts where the model can clearly distinguish right from wrong.

Testing across planning puzzles (Sokoban), math, and web navigation, the researchers found that this filtering method consistently boosted performance and stopped template collapse in its tracks. By teaching AI to ignore the “boring” problems where it wasn’t learning, they forced it to actually think about the hard ones.

AI Papers Reader

Personalized digests of latest AI research

Why Your AI Is Reciting Scripts: Understanding "Template Collapse" in Reinforcement Learning

The Student Analogy

The Signal-to-Noise Problem

A New Diagnostic and Cure

Chat about this paper