AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Reinforcement Learning Doesn't Just Teach LLMs to Reason—It Teaches Them How to Navigate Their Own Memories

New research challenges the widely accepted notion that Reinforcement Learning (RL) forces Large Language Models (LLMs) to sacrifice factual recall for improved reasoning. Instead, scientists have found that RL-enhanced models become significantly better at retrieving structured, hierarchical knowledge—not by memorizing new facts, but by mastering the process of navigating their existing internal knowledge graphs.

The finding overturns the prevailing wisdom that post-training techniques like Reinforcement Learning from Human Feedback (RLHF), often employed to make models safer and better aligned, necessarily incur an “alignment tax” where they forget specific facts.

Researchers from Simon Fraser University, FAIR at Meta, and ETH Zurich found that specialized reasoning models consistently outperform their base and supervised counterparts on tasks requiring the traversal of hierarchical structures, such as looking up medical codes or patent classifications.

The Code-Lookup Test

To illustrate, consider the challenge of identifying an ICD-9-CM medical code, such as 57.95. A standard, non-reasoning LLM (like DeepSeek-V3) often attempts a direct factual recall, which frequently fails due to the vastness of the code space.

However, an RL-enhanced model (like DeepSeek-R1) employs a systematic, multi-step process: first identifying the broad chapter (e.g., Chapter 11, codes 57.0-57.99), then narrowing down the relevant procedure set, and finally pinpointing the specific meaning (“Replacement of indwelling urinary catheter”).

The authors demonstrated that this improved performance is rooted in strategy, not memory. When they explicitly guided the base model with a structured prompt—forcing it to follow the same hierarchical steps—the accuracy gap between the base model and the RL-enhanced model shrank dramatically, reducing a 24 percentage point difference on the MedConceptsQA benchmark to just 7pp. This suggests the knowledge was latent but inaccessible without procedural guidance.

To confirm that RL improved navigation skills rather than just final answer accuracy, the team tested models on “Memory-Heavy” tasks, which required five or more complex hierarchical “hops” (traversals) to find the Nearest Common Ancestor of two patent codes. In these deepest retrieval tasks, RL models showed a superior Path Matching Score, proving they correctly navigated the hierarchy step-by-step, far exceeding the base models’ ability to simply recall the final answer.

The most compelling evidence came from layer-wise internal activation analysis. The researchers compared how LLMs process declarative statements (like “Code 57.95 refers to…”) versus interrogative queries (like “What is code 57.95?”).

They found that factual representations (the declarative statements) remained highly similar between the base and RL models, with high cosine similarity (0.85-0.92) across layers. Crucially, query representations (the questions) diverged significantly, showing a noticeable drop in similarity (0.65-0.73) in middle layers.

This asymmetry confirms that RL training fundamentally transforms how models process questions and traverse their internal knowledge—acting as enhanced cognitive scaffolding—while leaving the underlying factual knowledge representations intact.

The findings suggest a promising new direction for LLM development, advocating for training paradigms that separate knowledge acquisition (pretraining) from the organization and navigation of that knowledge (post-training techniques).