AI Papers Reader

Personalized digests of latest AI research

View on GitHub

LLMs Master Analogies by Abstracting Relations, But Struggle to Apply Them

In a detailed internal analysis of large language models (LLMs), researchers have uncovered the precise mechanisms by which these AI systems perform analogical reasoning, revealing a sophisticated, human-like ability to abstract relationships—but also a critical bottleneck in applying those abstractions to new contexts.

The study, conducted by researchers at Korea University and AIGEN Sciences, investigated the inner workings of models like Qwen and Llama-2 using two types of analogy tasks: proportional analogies (A is to B as C is to D) and story analogies (finding structural parallels between different narratives).

The findings confirm that successful analogical reasoning hinges on the LLM’s ability to encode high-level relational concepts, a process that occurs primarily in the “mid-upper layers” of the neural network.

The Relational vs. Attributive Divide

For proportional analogies—such as identifying the missing element in “Persuasion is to Jane Austen as 1984 is to ____”—the LLM must first extract the relationship (“author of”) and then apply it to the third entity (1984).

Using techniques like attention knockout and representation probing (Patchscopes), the team tracked two types of information within the model’s layers: attributive information (the concrete facts about an entity, e.g., Jane Austen is a novelist) and relational information (the connection between entities).

They discovered that attributive information remains robustly encoded regardless of whether the model is correct or incorrect. However, relational information shows a sharp decline in failure cases. This suggests LLMs often fail not because they don’t know the facts, but because they lose the thread of the abstract connection.

“LLMs effectively encode the underlying relationships between analogous entities,” the authors note. “But applying the relation often remains as much a bottleneck as encoding it.”

Strategic Patching Reveals Application Failures

Crucially, the researchers demonstrated that LLMs often struggle with transferring the learned relation. In incorrect cases, they intervened by strategically “patching” the model’s internal representations. By injecting the correct relational concept from a successful analogy—for instance, replacing the internal representation of the first pair in a failed task with the abstract “author of” relation extracted from a correct pair—they were able to rectify up to 38.4% of the initial errors.

This intervention confirmed that in many failures, the LLM had successfully extracted the relation from the first pair but failed to propagate that abstract structure through the model’s “link” position and apply it to the target entity.

Alignment is Key to Structural Understanding

To assess deeper cognitive alignment, the team examined story analogies, which demand finding parallels between semantically distinct situations. For example, recognizing that a story about a warrior determined to conquer the battlefield is structurally analogous to a patient determined to conquer their disease.

The analysis showed that successful reasoning is marked by a strong “Mutual Alignment Score” (MAS)—a measure of token-level alignment between the source and target stories. In correct cases, the LLM successfully aligned corresponding relational elements (e.g., mapping the token for ‘warrior’ to ‘patient’ and ‘battlefield’ to ‘disease’), even with minimal shared vocabulary.

Conversely, when the model failed, its internal alignment score was higher between the source story and the distractor story, indicating the model was relying too heavily on surface-level lexical similarity rather than true structural parallels.

Overall, the investigation reveals LLMs exhibit emergent, human-like capability in abstraction, but their limitations often stem from the fragile process of transferring and applying that abstract relational structure to novel targets. The work highlights both the parallels between human and artificial cognition, while paving the way for future improvements in LLM reasoning robustness.