claim

active

claim:attention-heads-can-be-understood-as-independent-operations-each-adding-their-output-to-the-residual-stream-equivalent-to-the-concatenate-and-multiply-formulation

Attention heads can be understood as independent operations each adding their output to the residual stream, equivalent to the concatenate-and-multiply formulation

Mathematical equivalence enabling independent analysis of each attention head

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Neighborhood — ranked by edge-count

Claims (1)

claim

Each attention head has two largely independent computations: a QK circuit computing the attention pattern and an OV circuit computing the effect if attended to
supports
Key decomposition enabling separate analysis of where attention goes and what it does

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Some attention heads partially specialize in copying for words that split into two tokens without a space prefix, attending from fragmented token to complete tokenfinding0.813
Interesting special case of copying behavior related to tokenization artifacts; primitive precursor to induction heads
Attention is a generalization of convolution; all convolutions can be expressed as tensor products of fixed relative position attention patterns and weight matricesclaim0.802
Mathematical equivalence showing the relationship between attention mechanisms and convolutional operations
Attention computations distribute across heads via parameter subcomponents with interpretable rolesfinding0.793
Mechanistic discovery about how attention mechanisms decompose into interpretable parameter components.
In the analyzed two-layer model, second-layer attention head terms dominate the loss reduction compared to first-layer terms and the direct pathfinding0.784
Result from term importance analysis breaking down loss contribution by layer
Induction heads explain in-context learning in small models and only develop in models with at least two attention layersclaim0.784
Central empirical claim of the paper; induction heads are shown to be the mechanism for powerful in-context learning
Attention algorithms are usually distributed across attention headsclaim0.780
Claim supported by VPD's recovery of cross-head attention subcomponents, noted in footnote.
How can mechanistic interpretability methods automatically identify attention computations that span multiple attention heads?question0.770
Long-standing bottleneck in mechanistic interpretability that VPD addresses by working natively on attention weight matrices.
Identification of algorithms implemented in attention layers, distributed across attention headsfinding0.769
VPD successfully recovered interpretable attention algorithms (previous-token behavior, syntax-boundary routing) in weight space without requiring manual decomposition across heads.