AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Rethinking AI's Math Skills: Do They Translate to Other Tasks?

Large language models (LLMs) have made remarkable strides in mathematical reasoning, often surpassing human performance on complex benchmarks. However, a new study questions whether these impressive math skills translate to broader problem-solving abilities or are merely a result of specialized training, potentially leading to “catastrophic forgetting” of general knowledge.

The research, published on arXiv, delves into the transferability of reasoning capabilities in LLMs across various domains, including math, scientific question answering, agent planning, coding, and standard instruction-following. The findings reveal a surprising disconnect: many models that excel at math struggle to apply their enhanced reasoning to other tasks.

The Key Differentiator: Training Methods

A significant factor influencing this transferability, the study found, is the fine-tuning paradigm used to train these models. Specifically, models fine-tuned using Reinforcement Learning (RL) demonstrated a much stronger ability to generalize their reasoning skills to non-math tasks. In contrast, models trained with Supervised Fine-Tuning (SFT), a more common approach, often showed a decline in their general capabilities after math-specific training.

To illustrate, imagine an AI trained intensely on solving complex algebra problems. While it might become exceptionally good at equations, it could then falter when asked to summarize a news article or answer a general knowledge question. The study suggests this is precisely what happens with SFT-tuned models. They seem to “forget” their broader understanding of language and the world as they specialize in math.

Unpacking the “Why”: Latent Space and Token Shifts

The researchers employed sophisticated techniques to understand why this divergence occurs. By analyzing the models’ internal representations (how they process information), they discovered that SFT tends to cause significant shifts in these representations, particularly for non-reasoning tasks. This is akin to reorganizing a library so drastically to house a new collection of math books that it becomes difficult to find existing general fiction.

RL, on the other hand, appears to preserve the original structure of the model’s knowledge. It adjusts the representations in a more targeted way, maintaining a better balance between specialized reasoning and general language understanding. This is like adding a new section for math books in the library, carefully integrating it without disrupting the existing organization.

Concrete Examples of Transferability

The study measured this transferability using a “Transferability Index.” For instance, one RL-trained model showed a significant positive transferability to “other reasoning” tasks (like scientific QA) and even “non-reasoning” tasks (like conversational AI), indicating its math improvements were beneficial across the board. Conversely, an SFT-trained model might show a negative transferability to non-reasoning tasks, meaning its math training actively harmed its performance in those areas.

A controlled experiment using the Qwen3-14B model further solidified these findings. When fine-tuned on math problems using RL, the model not only improved its math scores but also maintained or even enhanced its performance on general tasks. However, the SFT-tuned version of the same model, despite excelling at math, saw its performance degrade on non-math tasks.

Implications for Future AI Development

These findings have significant implications for how we train and evaluate LLMs. The research suggests that relying solely on SFT with math-distilled data might not be the optimal strategy for developing generally capable AI systems. Instead, RL-based approaches, which seem to foster a more robust and transferable form of reasoning, warrant further investigation. The study’s authors advocate for a re-evaluation of standard post-training methods, emphasizing the need for approaches that balance specialized skills with broad, general-purpose intelligence.