paper
active
2021
paper:2021-mathematical

A Mathematical Framework for Transformer Circuits

TL;DR

Induction heads — attention heads that search for prior occurrences of the current token and predict the following token — constitute the primary in-context learning mechanism in two-layer attention-only transformers, and emerge exclusively through K-composition between a first-layer previous-token head and second-layer heads; they do not appear in one-layer models. The paper introduces the path expansion trick as its core analytical instrument: by representing transformer computation as a sum over end-to-end paths rather than a product over layers, it renders the weights of zero-, one-, and two-layer attention-only models directly interpretable. Zero-layer transformers reduce to bigram log-likelihood tables accessible via W_U W_E; one-layer models decompose into bigram plus skip-trigram ("A…BC") ensembles readable from the ~2.5-billion-entry expanded OV and QK matrices (for a ~50,000-token vocabulary); two-layer models introduce three composition types (Q-, K-, V-composition) of which K-composition is empirically dominant in small models, enabling induction heads verified through eigenvalue analysis of W_OV and W_QK and confirmed on out-of-distribution random repeated token sequences. Models studied use configurations including 12 heads with d_head=64 and 32 heads with d_head=128, context size 2048 tokens. The paper argues that induction heads represent a qualitative algorithmic transition point — from statistical look-up to sequence-completion inference — that continues to be relevant in larger realistic language models, providing a replicable foothold for mechanistic interpretability of transformers at scale.

What to take away

  1. 1. Induction heads, which implement the algorithm '[a][b]…[a] → [b]' by K-composing with a previous-token head, are the dominant in-context learning mechanism in two-layer attention-only transformers and are absent from one-layer models.
  2. 2. One-layer attention-only transformers are mathematically equivalent to an ensemble of a bigram model and skip-trigram ('A…BC') models, and the full skip-trigram table can be read directly from the weights without running the model by expanding the OV circuit W_U W_{OV}^h W_E and QK circuit W_E^T W_{QK}^h W_E.
  3. 3. In a 12-head, d_head=64 model, eigenvalue analysis of the expanded OV matrices shows that 10 out of 12 attention heads exhibit significantly positive eigenvalue skew, consistent with copying behavior confirmed by qualitative inspection.
  4. 4. The paper introduces the path expansion trick: writing the transformer as a product over layers and expanding it into a sum over end-to-end paths, each corresponding to a linear (given frozen attention) or bilinear function of tokens, making every term individually interpretable.
  5. 5. Composition between attention heads takes three forms — Q-composition, K-composition, and V-composition — where Q- and K-composition enrich the attention pattern and V-composition creates 'virtual attention heads' expressible as (A^{h2}A^{h1}) ⊗ (W_{OV}^{h2}W_{OV}^{h1}); ablation experiments on a small two-layer model show virtual head (V-composition) terms contribute negligibly to loss reduction.
  6. 6. Induction heads verified on randomly sampled, out-of-distribution repeated token sequences (tokens drawn uniformly from a ~50,000-token vocabulary, repeated three times) continue to attend correctly to prior occurrences, confirming the mechanism is abstract rather than distributional.
  7. 7. A Frobenius-norm composition metric — e.g., ||W_{QK}^{h2} W_{OV}^{h1}||_F / (||W_{QK}^{h2}||_F ||W_{OV}^{h1}||_F) minus the random-matrix baseline — identifies that in the analyzed two-layer model only a single layer-0 head participates in significant K-composition, with all other inter-head composition near zero.
  8. 8. The paper introduces 'virtual weights' W_I^2 W_O^1 as the implicit direct connection between any two non-adjacent layers via the residual stream, noting that at layer 25 of a 50-layer transformer, the residual stream communicates in superposition with 100× more neurons than it has dimensions on each side.
  9. 9. One-layer models produce systematic skip-trigram 'bugs': because the OV and QK circuits factorize three-token interactions as f1(a,b)·f2(a,c), boosting 'keep…in→mind' and 'keep…at→bay' necessarily also boosts 'keep…in→bay' and 'keep…at→mind', a failure mode the authors identify as directly readable from expanded weights.
  10. 10. An open question the paper raises is whether the eigenvalue positivity summary statistic is the right formalization of 'copying matrix,' given that non-orthogonal eigenvectors allow matrices with all positive eigenvalues to still decrease some tokens' self-logits, and whether alternative notions (diagonal analysis, top-k self-promotion rate) would be more robust or reveal different structure in larger models.

Peer brief — for seminar discussion

Elhage et al. (2021) develop a mathematical framework for mechanistic interpretability of transformers by studying decoder-only, attention-only models with zero, one, and two layers, using configurations including 12 heads at d_head=64 and 32 heads at d_head=128 with a context length of 2048 tokens trained on the dataset described in Kaplan et al. The core method is the path expansion trick: rather than analyzing the residual stream directly, the transformer's computation is written as a product over layers and algebraically expanded into a sum of end-to-end path terms, each a linear (or bilinear) function of input tokens when attention patterns are frozen. This converts opaque layer-wise computations into interpretable circuits: zero-layer models reduce to a bigram table W_U W_E; one-layer models become ensembles of that bigram term plus per-head skip-trigram terms reading from the ~50,000×50,000 expanded OV circuit W_U W_{OV}^h W_E and QK circuit W_E^T W_{QK}^h W_E; two-layer models introduce Q-, K-, and V-composition between heads. The load-bearing finding is that K-composition between a first-layer previous-token head and second-layer heads produces induction heads — circuits implementing '[a][b]…[a] → [b]' — as the dominant in-context learning mechanism in two-layer models, a qualitative algorithmic leap over the one-layer copying heads that implement only '[b]…[a] → [b]'. Eigenvalue analysis of W_OV confirms 10 out of 12 heads in a one-layer model are copying-dominated; V-composition ablations show virtual attention head terms contribute negligible marginal loss in the two-layer model; and induction heads generalize to randomly sampled out-of-distribution repeated token sequences, confirming the mechanism is abstract. The framework predicts that induction heads and K-composition with a previous-token head should recur as building blocks in larger models and drive in-context learning at scale — a hypothesis explicitly deferred to a follow-up paper on larger realistic language models. An alternative analytical approach the paper could have used is gradient-based attribution (e.g., integrated gradients), but it explicitly contrasts with that direction, arguing that attention weight analysis without the full circuit decomposition systematically misattributes importance, as softmax saturation suppresses gradients at the very tokens the induction head most confidently uses. One thing a critical reader should push back on is the treatment of layer normalization: throughout the theoretical development, layer norm is folded into adjacent weights or treated as a scalar rescaling, but this approximation breaks comparability across paths through different layer norms and is not validated quantitatively — it is unclear how much the clean circuit algebra degrades for standard trained models where layer norm scaling is non-trivial. Additionally, the entire framework is developed on attention-only models, which lack the MLP layers comprising roughly two-thirds of parameters in standard transformers like GPT-3 with its 96 layers; the authors acknowledge this but the scope limitation means the claimed relevance to large models rests almost entirely on the forthcoming follow-up rather than evidence presented here.

Methods (5)

  • Frobenius Norm Composition Measurement
    Measuring Q-, K-, V-composition between attention heads by computing the Frobenius norm of the product of relevant matrices divided by norms of individual matrices
  • Logit Lens
    Unsupervised interpretability technique that projects activations through unembedding matrix; provides comparison point for NLA approach.
  • Path Expansion Method
    The core analytical technique of expanding transformer computations from layer-by-layer products into sums of end-to-end path terms for independent analysis
  • Term Importance Analysis via Ablation
    An algorithm that determines the marginal effect of n-th order path terms by running the model multiple times with frozen attention patterns and progressively replacing activations
  • Value-Weighted Attention Pattern Visualization
    Visualizing attention patterns weighted by the norm of value vectors to better show how much information is moved from each position

Frameworks (2)

Findings (9)

Claims (20)

Hypotheses (4)

Questions (5)

Original abstract (expand)

This paper presents a mathematical framework for reverse-engineering transformer circuits by studying small attention-only models with at most two layers. We discover that transformers can be conceptualized as sums of interpretable end-to-end computational paths, and identify "induction heads" as a key mechanism for in-context learning that emerges in two-layer models. By analyzing zero, one, and two-layer attention-only transformers, we show they implement bigram statistics, skip-trigram models, and complex compositional algorithms respectively, with findings that provide foundational insights applicable to larger language models.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Similar preprints — Semantic Scholar

Cited by (6)