Why can't transformers learn multiplication? reverse-engineering reveals long-range dependency pitfalls

ByXiaoyan Bai·Itamar Pres·Yuntian Deng·Chenhao Tan·Stuart M. Shieber·Fernanda B. Viégas+2 more

DOI 10.48550/arxiv.2510.00184 arXiv 2510.00184

Original abstract (expand)

Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication. In this work, we study why, by reverse-engineering a model that successfully learns multiplication via \emph{implicit chain-of-thought}, and report three findings: (1) Evidence of long-range structure: Logit attributions and linear probes indicate that the model encodes the necessary long-range dependencies for multi-digit multiplication. (2) Mechanism: the model encodes long-range dependencies using attention to construct a directed acyclic graph to ``cache'' and ``retrieve'' pairwise partial products. (3) Geometry: the model implements partial products in attention heads by forming Minkowski sums between pairs of digits, and digits are represented using a Fourier basis, both of which are intuitive and efficient representations that the standard fine-tuning model lacks. With these insights, we revisit the learning dynamics of standard fine-tuning and find that the model converges to a local optimum that lacks the required long-range dependencies. We further validate this understanding by introducing an auxiliary loss that predicts the ``running sum'' via a linear regression probe, which provides an inductive bias that enables the model to successfully learn multi-digit multiplication. In summary, by reverse-engineering the mechanisms of an implicit chain-of-thought model we uncover a pitfall for learning long-range dependencies in Transformers and provide an example of how the correct inductive bias can address this issue.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Transformers converge to invariant algorithmic cores
Joshua S. Schiffman
2026
≈ 79%
Understanding Addition and Subtraction in Transformers
Clement Neo, Fazl Barez Philip Quirke
2025
≈ 78%
On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication
Zichao Wei
2026
≈ 78%
Algorithmic Capabilities of Random Transformers
Jacob Andreas Ziqian Zhong
2024
≈ 78%
Can Transformers Learn to Solve Problems Recursively?
Curt Tigges, Stella Biderman, Maxim Raginsky, Talia Ringer Shizhuo Dylan Zhang
2023
≈ 78%
Learning Modular Exponentiation with Transformers
Sara M. Kapoor, Theo Simon Sorg, Challenger Mishra David Demitri Africa
2025
≈ 76%
Mechanistic Interpretability of Binary and Ternary Transformers
Jason Li
2024
≈ 76%
How Transformers Get Rich: Approximation and Dynamics Analysis
Ruoxi Yu, Weinan E, Lei Wu Mingze Wang
2025
≈ 75%
An Introduction to Transformers
Richard E. Turner
2026
≈ 75%
How Do Transformers "Do" Physics? Investigating the Simple Harmonic Oscillator
Ziming Liu, and Max Tegmark Subhash Kantamneni
2024
≈ 74%
How Transformers Learn Causal Structure with Gradient Descent
Alex Damian, Jason D. Lee Eshaan Nichani
2024
≈ 74%
Learning Transformer Programs
Alexander Wettig, Danqi Chen Dan Friedman
2023
≈ 74%
Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities
Yana Veitsman, Yash Sarrof, Aleksandra Bakalova, Vera Demberg, Ellie Pavlick, Michael Hahn Mayank Jobanputra
2025
≈ 74%
Birth of a Transformer: A Memory Viewpoint
Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou Alberto Bietti
2023
≈ 74%
Transformer See, Transformer Do: Copying as an Intermediate Step in Learning Analogical Reasoning
Willem Zuidema, Claire E. Stevenson, Martha Lewis Philipp Hellwig
2026
≈ 74%
Relating transformers to models and neural representations of the hippocampal formation
in corpus
2021
≈ 73%
Learning without neurons in physical systems
in corpus
2022
≈ 71%
A Mathematical Framework for Transformer Circuits
in corpus
2021
≈ 70%
Living Things Are Not (20th Century) Machines: Updating Mechanism Metaphors in Light of the Modern Science of Machine Behavior
in corpus
2021
≈ 69%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 69%
Simulators — LessWrong
in corpus
≈ 68%
Janus Information Flow Transformers 2025
in corpus
≈ 68%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 68%
Model Alignment Search
in corpus
2025
≈ 67%
The Platonic Representation Hypothesis
in corpus
2024
≈ 67%
Unveiling the Latent Directions of Reflection in Large Language Models
in corpus
2025
≈ 66%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 66%

Similar preprints — Semantic Scholar

Cited by (1)

Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as