paper:arxiv-2510-00184Why can't transformers learn multiplication? reverse-engineering reveals long-range dependency pitfalls
Original abstract (expand)
Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication. In this work, we study why, by reverse-engineering a model that successfully learns multiplication via \emph{implicit chain-of-thought}, and report three findings: (1) Evidence of long-range structure: Logit attributions and linear probes indicate that the model encodes the necessary long-range dependencies for multi-digit multiplication. (2) Mechanism: the model encodes long-range dependencies using attention to construct a directed acyclic graph to ``cache'' and ``retrieve'' pairwise partial products. (3) Geometry: the model implements partial products in attention heads by forming Minkowski sums between pairs of digits, and digits are represented using a Fourier basis, both of which are intuitive and efficient representations that the standard fine-tuning model lacks. With these insights, we revisit the learning dynamics of standard fine-tuning and find that the model converges to a local optimum that lacks the required long-range dependencies. We further validate this understanding by introducing an auxiliary loss that predicts the ``running sum'' via a linear regression probe, which provides an inductive bias that enables the model to successfully learn multi-digit multiplication. In summary, by reverse-engineering the mechanisms of an implicit chain-of-thought model we uncover a pitfall for learning long-range dependencies in Transformers and provide an example of how the correct inductive bias can address this issue.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- ≈ 79%
- ≈ 78%
- ≈ 78%
- ≈ 78%
- Can Transformers Learn to Solve Problems Recursively?Curt Tigges, Stella Biderman, Maxim Raginsky, Talia Ringer Shizhuo Dylan Zhang2023≈ 78%
- Learning Modular Exponentiation with TransformersSara M. Kapoor, Theo Simon Sorg, Challenger Mishra David Demitri Africa2025≈ 76%
- ≈ 76%
- How Transformers Get Rich: Approximation and Dynamics AnalysisRuoxi Yu, Weinan E, Lei Wu Mingze Wang2025≈ 75%
- ≈ 75%
- How Do Transformers "Do" Physics? Investigating the Simple Harmonic OscillatorZiming Liu, and Max Tegmark Subhash Kantamneni2024≈ 74%
- How Transformers Learn Causal Structure with Gradient DescentAlex Damian, Jason D. Lee Eshaan Nichani2024≈ 74%
- ≈ 74%
- Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural AbilitiesYana Veitsman, Yash Sarrof, Aleksandra Bakalova, Vera Demberg, Ellie Pavlick, Michael Hahn Mayank Jobanputra2025≈ 74%
- Birth of a Transformer: A Memory ViewpointVivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou Alberto Bietti2023≈ 74%
- Transformer See, Transformer Do: Copying as an Intermediate Step in Learning Analogical ReasoningWillem Zuidema, Claire E. Stevenson, Martha Lewis Philipp Hellwig2026≈ 74%
- Relating transformers to models and neural representations of the hippocampal formationin corpus2021≈ 73%
- Learning without neurons in physical systemsin corpus2022≈ 71%
- A Mathematical Framework for Transformer Circuitsin corpus2021≈ 70%
- ≈ 69%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 69%
- Simulators — LessWrongin corpus≈ 68%
- ≈ 68%
- ≈ 68%
- Model Alignment Searchin corpus2025≈ 67%
- The Platonic Representation Hypothesisin corpus2024≈ 67%
- ≈ 66%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 66%
Similar preprints — Semantic Scholar
Cited by (1)
- Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as