claim

active

claim:induction-heads-work-by-using-k-composition-with-a-previous-token-head-to-shift-keys-by-one-token-then-matching-the-current-destination-token-against-shifted-keys-to-predict-what-follows

Induction heads work by using K-composition with a previous token head to shift keys by one token, then matching the current destination token against shifted keys to predict what follows

The mechanistic explanation of how induction heads are implemented in two-layer models

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Neighborhood — ranked by edge-count

Findings (2)

finding

All induction heads in the two-layer model occupy an extreme corner of high positive QK and OV eigenvalue positivity space relative to non-induction heads
associated_withsupports
Quantitative verification of the mechanistic theory; both circuits required for the induction algorithm show the predicted copying/matching structure
Induction heads in two-layer models successfully perform in-context learning on completely random repeated token sequences far outside training distribution
associated_with
Strong test of the induction head hypothesis using uniformly sampled random tokens repeated three times

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Large models form many induction heads built from K-composition with a previous token head, making induction heads a central driver of in-context learning at all scalesclaim0.852
Forward-looking claim connecting toy model findings to large-scale language models
GPT-2 implements at least one induction head using pointer arithmetic on positional embeddings rather than K-compositionhypothesis0.807
Observation of an alternative induction head implementation algorithm in larger models with positional embeddings in the residual stream
Induction heads explain in-context learning in small models and only develop in models with at least two attention layersclaim0.796
Central empirical claim of the paper; induction heads are shown to be the mechanism for powerful in-context learning
A pair of query and key subcomponents distributed across attention heads performs previous-token behaviorfinding0.778
VPD recovers an attention algorithm for attending to the previous token, distributed across multiple heads.
The Primer architecture's depthwise convolution change would allow induction heads to form without requiring K-compositionhypothesis0.774
Architectural interpretation of how Primer's design change relates to the paper's mechanistic theory of induction heads
All induction heads fall in an extreme corner of high OV eigenvalue positivity and high QK eigenvalue positivity, confirming the mechanistic theoryclaim0.768
Quantitative verification that the copying and matching structure predicted by the mechanistic theory is present in all observed induction heads
In-Context Learning and Induction Heads (forthcoming paper)concept0.758
A follow-up paper extending the framework and induction head concept to larger more realistic models
Some attention heads partially specialize in copying for words that split into two tokens without a space prefix, attending from fragmented token to complete tokenfinding0.746
Interesting special case of copying behavior related to tokenization artifacts; primitive precursor to induction heads