Why Can't Transformers Learn Multiplication? New Research Reveals Long-Range Dependency Pitfalls
Despite their impressive capabilities in tasks like language understanding and generation, large language models (LLMs) notoriously struggle with seemingly simple arithmetic, such as multi-digit multiplication. A new study delves into this enigma, reverse-engineering a successful multiplication model to pinpoint why standard LLMs falter and how this limitation can be overcome.
The research, published in arXiv, contrasts a standard fine-tuned (SFT) model, which fails at multiplication, with a model trained using an “implicit chain-of-thought” (ICoT) approach. The ICoT method initially guides the model with explicit step-by-step reasoning tokens, gradually phasing them out to encourage the model to internalize this reasoning process.
Key Findings from the ICoT Model:
The study reveals three core insights from the ICoT model’s success:
-
Evidence of Long-Range Structure: The ICoT model demonstrably encodes the necessary long-range dependencies for multiplication. Researchers used “logit attributions” (examining which input parts influence output predictions) and “linear probes” (testing if specific information can be extracted from the model’s internal states) to confirm this. For instance, to calculate a product digit, the model needs to consider multiple digits from both operands, often separated by many tokens.
-
Mechanism of Long-Range Dependency: The ICoT model constructs a sophisticated “attention tree.” This mechanism allows the model to strategically select pairs of digits for calculating partial products and then “cache” and “retrieve” these intermediate results. Think of it like a smart notepad where the model writes down necessary partial calculations in a structured way, making them available when needed for later steps.
-
Geometric Representation: Interestingly, the ICoT model employs an efficient and intuitive geometric approach. Partial products are formed through “Minkowski sums” of digit representations, and digits themselves are represented using a “Fourier basis.” This means the model develops a structured internal representation of numbers that facilitates the complex dependencies required for multiplication.
The Pitfall of Standard Fine-Tuning:
The researchers found that standard fine-tuning, using gradient descent and an auto-regressive loss, leads the SFT model to converge to a “local optimum.” This optimum lacks the crucial long-range dependency understanding. As a result, the model can learn the first and last digits of a multiplication result relatively easily, but gets stuck when trying to compute the middle digits, which rely heavily on these complex dependencies.
A Solution: Inductive Bias:
To address this limitation, the researchers introduced an auxiliary loss. This additional objective encouraged the model to predict a “running partial sum” using a simple linear regression. This “inductive bias”—a built-in preference for learning certain types of patterns—enabled the model to effectively learn the required long-range dependencies and achieve near-perfect accuracy on multiplication tasks, even without the explicit chain-of-thought guidance used during ICoT training.
In essence, this study not only uncovers why current Transformer architectures struggle with tasks requiring long-range dependencies but also offers a concrete path toward building models that can master such challenges. The findings suggest that future advancements may lie in designing training methodologies and model architectures that inherently encourage the learning of these critical long-range relationships.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.