A Mathematical Framework for Transformer Circuits

ByNelson Elhage·Neel Nanda·Catherine Olsson·Chris Olah·Tom Henighan·Dario AmodeiAnthropic, OpenAI

Bottleneck Activation A Mathematical Framework for Transformer Circuits Frobenius Norm Composition Measurement Induction Heads Distill Circuits Thread Logit Lens Mechanistic Interpretability Path Expansion Method OV Circuit Term Importance Analysis via Ablation QK Circuit Value-Weighted Attention Pattern Visualization Residual Stream Bandwidth Skip-Trigram+2 more

TL;DR

Induction heads — attention heads that search for prior occurrences of the current token and predict the following token — constitute the primary in-context learning mechanism in two-layer attention-only transformers, and emerge exclusively through K-composition between a first-layer previous-token head and second-layer heads; they do not appear in one-layer models. The paper introduces the path expansion trick as its core analytical instrument: by representing transformer computation as a sum over end-to-end paths rather than a product over layers, it renders the weights of zero-, one-, and two-layer attention-only models directly interpretable. Zero-layer transformers reduce to bigram log-likelihood tables accessible via W_U W_E; one-layer models decompose into bigram plus skip-trigram ("A…BC") ensembles readable from the ~2.5-billion-entry expanded OV and QK matrices (for a ~50,000-token vocabulary); two-layer models introduce three composition types (Q-, K-, V-composition) of which K-composition is empirically dominant in small models, enabling induction heads verified through eigenvalue analysis of W_OV and W_QK and confirmed on out-of-distribution random repeated token sequences. Models studied use configurations including 12 heads with d_head=64 and 32 heads with d_head=128, context size 2048 tokens. The paper argues that induction heads represent a qualitative algorithmic transition point — from statistical look-up to sequence-completion inference — that continues to be relevant in larger realistic language models, providing a replicable foothold for mechanistic interpretability of transformers at scale.

What to take away

1. Induction heads, which implement the algorithm '[a][b]…[a] → [b]' by K-composing with a previous-token head, are the dominant in-context learning mechanism in two-layer attention-only transformers and are absent from one-layer models.
2. One-layer attention-only transformers are mathematically equivalent to an ensemble of a bigram model and skip-trigram ('A…BC') models, and the full skip-trigram table can be read directly from the weights without running the model by expanding the OV circuit W_U W_{OV}^h W_E and QK circuit W_E^T W_{QK}^h W_E.
3. In a 12-head, d_head=64 model, eigenvalue analysis of the expanded OV matrices shows that 10 out of 12 attention heads exhibit significantly positive eigenvalue skew, consistent with copying behavior confirmed by qualitative inspection.
4. The paper introduces the path expansion trick: writing the transformer as a product over layers and expanding it into a sum over end-to-end paths, each corresponding to a linear (given frozen attention) or bilinear function of tokens, making every term individually interpretable.
5. Composition between attention heads takes three forms — Q-composition, K-composition, and V-composition — where Q- and K-composition enrich the attention pattern and V-composition creates 'virtual attention heads' expressible as (A^{h2}A^{h1}) ⊗ (W_{OV}^{h2}W_{OV}^{h1}); ablation experiments on a small two-layer model show virtual head (V-composition) terms contribute negligibly to loss reduction.
6. Induction heads verified on randomly sampled, out-of-distribution repeated token sequences (tokens drawn uniformly from a ~50,000-token vocabulary, repeated three times) continue to attend correctly to prior occurrences, confirming the mechanism is abstract rather than distributional.
7. A Frobenius-norm composition metric — e.g., ||W_{QK}^{h2} W_{OV}^{h1}||_F / (||W_{QK}^{h2}||_F ||W_{OV}^{h1}||_F) minus the random-matrix baseline — identifies that in the analyzed two-layer model only a single layer-0 head participates in significant K-composition, with all other inter-head composition near zero.
8. The paper introduces 'virtual weights' W_I^2 W_O^1 as the implicit direct connection between any two non-adjacent layers via the residual stream, noting that at layer 25 of a 50-layer transformer, the residual stream communicates in superposition with 100× more neurons than it has dimensions on each side.
9. One-layer models produce systematic skip-trigram 'bugs': because the OV and QK circuits factorize three-token interactions as f1(a,b)·f2(a,c), boosting 'keep…in→mind' and 'keep…at→bay' necessarily also boosts 'keep…in→bay' and 'keep…at→mind', a failure mode the authors identify as directly readable from expanded weights.
10. An open question the paper raises is whether the eigenvalue positivity summary statistic is the right formalization of 'copying matrix,' given that non-orthogonal eigenvectors allow matrices with all positive eigenvalues to still decrease some tokens' self-logits, and whether alternative notions (diagonal analysis, top-k self-promotion rate) would be more robust or reveal different structure in larger models.

Peer brief — for seminar discussion

Elhage et al. (2021) develop a mathematical framework for mechanistic interpretability of transformers by studying decoder-only, attention-only models with zero, one, and two layers, using configurations including 12 heads at d_head=64 and 32 heads at d_head=128 with a context length of 2048 tokens trained on the dataset described in Kaplan et al. The core method is the path expansion trick: rather than analyzing the residual stream directly, the transformer's computation is written as a product over layers and algebraically expanded into a sum of end-to-end path terms, each a linear (or bilinear) function of input tokens when attention patterns are frozen. This converts opaque layer-wise computations into interpretable circuits: zero-layer models reduce to a bigram table W_U W_E; one-layer models become ensembles of that bigram term plus per-head skip-trigram terms reading from the ~50,000×50,000 expanded OV circuit W_U W_{OV}^h W_E and QK circuit W_E^T W_{QK}^h W_E; two-layer models introduce Q-, K-, and V-composition between heads. The load-bearing finding is that K-composition between a first-layer previous-token head and second-layer heads produces induction heads — circuits implementing '[a][b]…[a] → [b]' — as the dominant in-context learning mechanism in two-layer models, a qualitative algorithmic leap over the one-layer copying heads that implement only '[b]…[a] → [b]'. Eigenvalue analysis of W_OV confirms 10 out of 12 heads in a one-layer model are copying-dominated; V-composition ablations show virtual attention head terms contribute negligible marginal loss in the two-layer model; and induction heads generalize to randomly sampled out-of-distribution repeated token sequences, confirming the mechanism is abstract. The framework predicts that induction heads and K-composition with a previous-token head should recur as building blocks in larger models and drive in-context learning at scale — a hypothesis explicitly deferred to a follow-up paper on larger realistic language models. An alternative analytical approach the paper could have used is gradient-based attribution (e.g., integrated gradients), but it explicitly contrasts with that direction, arguing that attention weight analysis without the full circuit decomposition systematically misattributes importance, as softmax saturation suppresses gradients at the very tokens the induction head most confidently uses. One thing a critical reader should push back on is the treatment of layer normalization: throughout the theoretical development, layer norm is folded into adjacent weights or treated as a scalar rescaling, but this approximation breaks comparability across paths through different layer norms and is not validated quantitatively — it is unclear how much the clean circuit algebra degrades for standard trained models where layer norm scaling is non-trivial. Additionally, the entire framework is developed on attention-only models, which lack the MLP layers comprising roughly two-thirds of parameters in standard transformers like GPT-3 with its 96 layers; the authors acknowledge this but the scope limitation means the claimed relevance to large models rests almost entirely on the forthcoming follow-up rather than evidence presented here.

Methods (5)

Frobenius Norm Composition Measurement
Measuring Q-, K-, V-composition between attention heads by computing the Frobenius norm of the product of relevant matrices divided by norms of individual matrices
Logit Lens
Unsupervised interpretability technique that projects activations through unembedding matrix; provides comparison point for NLA approach.
Path Expansion Method
The core analytical technique of expanding transformer computations from layer-by-layer products into sums of end-to-end path terms for independent analysis
Term Importance Analysis via Ablation
An algorithm that determines the marginal effect of n-th order path terms by running the model multiple times with frozen attention patterns and progressively replacing activations
Value-Weighted Attention Pattern Visualization
Visualizing attention patterns weighted by the norm of value vectors to better show how much information is moved from each position

Frameworks (2)

A Mathematical Framework for Transformer Circuits
Prior Anthropic paper enabling circuit-level analysis of attention-only transformers; motivates current MLP decomposition
Distill Circuits Thread
Prior mechanistic interpretability work reverse-engineering vision models (InceptionV1); the direct predecessor this paper extends to language models

Findings (9)

Some attention heads partially specialize in copying for words that split into two tokens without a space prefix, attending from fragmented token to complete token
Interesting special case of copying behavior related to tokenization artifacts; primitive precursor to induction heads
Induction heads in two-layer models successfully perform in-context learning on completely random repeated token sequences far outside training distribution
Strong test of the induction head hypothesis using uniformly sampled random tokens repeated three times
All induction heads in the two-layer model occupy an extreme corner of high positive QK and OV eigenvalue positivity space relative to non-induction heads
Quantitative verification of the mechanistic theory; both circuits required for the induction algorithm show the predicted copying/matching structure
One-layer model attention heads encode Python-specific skip-trigrams including indentation-based elif/else prediction and function signature patterns
Concrete example from examining expanded QK/OV matrices showing how specific programming language structure is encoded in attention weights
PCA analysis shows token embeddings and unembeddings are concentrated in a relatively small fraction of residual stream dimensions in large models
Supporting evidence for the claim that most residual stream dimensions are free for other layers to use
In the analyzed two-layer attention-only model, only K-composition is significant; V- and Q-composition are negligible by Frobenius norm measure
Result from applying the Frobenius norm composition measurement to all attention head pairs in the two-layer model
In the analyzed two-layer model, second-layer attention head terms dominate the loss reduction compared to first-layer terms and the direct path
Result from term importance analysis breaking down loss contribution by layer
Second-order virtual attention head terms contribute negligible marginal loss reduction in the analyzed two-layer attention-only model
Result of term importance analysis ablation experiment; justifies focusing on individual head terms
10 out of 12 attention heads in the 12-head one-layer model show significantly positive eigenvalue sums, indicating copying behavior
Quantitative result from eigenvalue analysis of expanded OV matrices; confirmed by qualitative inspection

Claims (20)

Induction heads work by using K-composition with a previous token head to shift keys by one token, then matching the current destination token against shifted keys to predict what follows
The mechanistic explanation of how induction heads are implemented in two-layer models
Large models form many induction heads built from K-composition with a previous token head, making induction heads a central driver of in-context learning at all scales
Forward-looking claim connecting toy model findings to large-scale language models
Attention is a generalization of convolution; all convolutions can be expressed as tensor products of fixed relative position attention patterns and weight matrices
Mathematical equivalence showing the relationship between attention mechanisms and convolutional operations
Attention heads can be understood as independent operations each adding their output to the residual stream, equivalent to the concatenate-and-multiply formulation
Mathematical equivalence enabling independent analysis of each attention head
Each attention head has two largely independent computations: a QK circuit computing the attention pattern and an OV circuit computing the effect if attended to
Key decomposition enabling separate analysis of where attention goes and what it does
In small two-layer attention-only transformers, the only significant composition is K-composition between a single first-layer head and some second-layer heads
Empirical observation from the specific two-layer model analyzed; no significant V- or Q-composition found
MLP layers are much harder to get traction on than attention layers; understanding them requires individually interpretable neurons which are rarely found
Key limitation of the paper's approach; MLP layers make up 2/3 of standard transformer parameters
All induction heads fall in an extreme corner of high OV eigenvalue positivity and high QK eigenvalue positivity, confirming the mechanistic theory
Quantitative verification that the copying and matching structure predicted by the mechanistic theory is present in all observed induction heads
Key, query, and value vectors are intermediary byproducts; W_OV and W_QK are the fundamental low-rank matrices describing attention head behavior
Reframing observation: the canonical K/Q/V decomposition is computationally convenient but not the most interpretable representation
Two-layer attention-only transformers implement much more complex algorithms via composition of attention heads, detectable directly from weights
Core claim for two-layer models; composition creates qualitatively more powerful in-context learning

Hypotheses (4)

Virtual attention heads (V-composition) may be much more important in larger and more complex transformers than in two-layer toy models
Forward-looking speculation based on the theoretical elegance and combinatorial growth of virtual head count with depth
The mathematical framework and induction head concept will remain at least partially relevant for larger, more realistic models
Central motivating hypothesis for the forthcoming paper on in-context learning and induction heads
The Primer architecture's depthwise convolution change would allow induction heads to form without requiring K-composition
Architectural interpretation of how Primer's design change relates to the paper's mechanistic theory of induction heads
GPT-2 implements at least one induction head using pointer arithmetic on positional embeddings rather than K-composition
Observation of an alternative induction head implementation algorithm in larger models with positional embeddings in the residual stream

Questions (5)

When and how can MLP neurons in transformers be individually interpreted, and what progress is needed to extend mechanistic interpretability to them?
Major open problem identified in the paper; MLP layers constitute 2/3 of transformer parameters
How much performance (in points of loss) do skip-trigram bugs cost the model, and do they persist in larger models?
Open question raised by the paper's identification of skip-trigram bugs as interpretability-visible failure modes
What is the correct formal definition of a 'copying matrix' that captures all and only the cases we care about?
Open methodological question about summarizing OV matrix behavior; eigenvalues are used as a working but imperfect proxy
What matrix decomposition or dimensionality reduction best summarizes the enormous low-rank OV and QK matrices?
Open methodological question about converting the 50k x 50k expanded matrices into human-graspable summaries
Do we 'fully understand' one-layer attention-only transformers?
The paper explicitly asks and addresses this question, concluding the answer depends on what 'fully understand' means

Original abstract (expand)

This paper presents a mathematical framework for reverse-engineering transformer circuits by studying small attention-only models with at most two layers. We discover that transformers can be conceptualized as sums of interpretable end-to-end computational paths, and identify "induction heads" as a key mechanism for in-context learning that emerges in two-layer models. By analyzing zero, one, and two-layer attention-only transformers, we show they implement bigram statistics, skip-trigram models, and complex compositional algorithms respectively, with findings that provide foundational insights applicable to larger language models.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Selective Induction Heads: How Transformers Select Causal Structures In Context
Francesco Croce, Nicolas Flammarion Francesco D'Angelo
2025
≈ 90%
What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains
Marco Bondaschi, Nived Rajaraman, Jason D. Lee, Michael Gastpar, Ashok Vardhan Makkuva, Paul Pu Liang Chanakya Ekbote
2025
≈ 89%
Rethinking Associative Memory Mechanism in Induction Head
Issei Sato Shuo Wang
2025
≈ 88%
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers
Heejune Sheen, Tianhao Wang, Zhuoran Yang Siyu Chen
2024
≈ 88%
Is Random Attention Sufficient for Sequence Modeling? Disentangling Trainable Components in the Transformer
Lorenzo Noci, Mikhail Khodak, Mufan Li Yihe Dong
2025
≈ 87%
How Transformers Get Rich: Approximation and Dynamics Analysis
Ruoxi Yu, Weinan E, Lei Wu Mingze Wang
2025
≈ 87%
Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence
Hiroki Furuta, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo Gouki Minegishi
2025
≈ 87%
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation
Ted Moskovitz, Felix Hill, Stephanie C.Y. Chan, Andrew M. Saxe Aaditya K. Singh
2024
≈ 87%
On the Emergence of Induction Heads for In-Context Learning
Tiago Pimentel, Lorenzo Noci, Alessandro Stolfo, Mrinmaya Sachan, Thomas Hofmann Tiberiu Musat
2026
≈ 87%
Birth of a Transformer: A Memory Viewpoint
Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou Alberto Bietti
2023
≈ 86%
From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers
Yujin Song, Alberto Bietti, Naoki Nishikawa, Taiji Suzuki, Samuel Vaiter, Denny Wu Ryotaro Kawata
2025
≈ 86%
Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers
Rabin Adhikari
2025
≈ 85%
Quantifying LLM Attention-Head Stability: Implications for Circuit Universality
Jack Stanley, Praneet Suresh, Danilo Bzdok Karan Bali
2026
≈ 85%
Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits
Areeb Ahmad and Abhinav Joshi and Ashutosh Modi
2025
≈ 85%
How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability
Changdae Oh, Zhen Fang, Sharon Li Shawn Im
2026
≈ 85%
Relating transformers to models and neural representations of the hippocampal formation
in corpus
2021
≈ 81%
Active Inference with a Self-Prior in the Mirror-Mark Task
in corpus
2026
≈ 81%
Janus Information Flow Transformers 2025
in corpus
≈ 80%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 79%
Zoom In: An Introduction to Circuits
in corpus
2020
≈ 79%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 79%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 79%
Anima Labs Phenomenology Pt1
in corpus
≈ 79%
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
in corpus
2026
≈ 79%
Model Alignment Search
in corpus
2025
≈ 78%
Simulators — LessWrong
in corpus
≈ 78%
Emergent Introspective Awareness in Large Language Models
in corpus
2026
≈ 78%
The Platonic Representation Hypothesis
in corpus
2024
≈ 78%
Testing the Limits of Truth Directions in LLMs
in corpus
2026
≈ 77%

Similar preprints — Semantic Scholar

Cited by (6)

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assump
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
Strategic deception in chain-of-thought (CoT) reasoning models is measurable, inducible, and controllable via representation engineering—a finding with direct implications for AI alignment. Applied to
Endogenous Resistance to Activation Steering in Language Models
Unveiling the Latent Directions of Reflection in Large Language Models
Reflection in LLMs corresponds to a recoverable latent direction in activation space, not merely a behavioral artifact of prompt engineering. Working with Qwen2.5-3B and Gemma3-4B-IT on the adversaria
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Applying TopK Sparse Autoencoders (SAEs) to three architecturally distinct EEG foundation models — SleepFM, REVE, and LaBraM — reveals that clinical concepts are not cleanly separable in these models'