Attention algorithms are usually distributed across attention heads

Claim supported by VPD's recovery of cross-head attention subcomponents, noted in footnote.

Source paper

extracted_from

cimcWhitepaper

Neighborhood — ranked by edge-count

Findings (2)

finding

A pair of query and key subcomponents distributed across attention heads performs previous-token behavior
supports
VPD recovers an attention algorithm for attending to the previous token, distributed across multiple heads.
A pair of query and key subcomponents distributed across attention heads performs syntax-boundary routing
supports
VPD recovers an attention algorithm for routing across syntactic boundaries, distributed across heads.

Communities (4)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Mechanistic structure of transformer attention computations
members_of
Identifies distributed algorithms implemented across attention heads, with focus on causal masking limitations and emergent capabilities via activation manifold steering.
Distributed attention head decomposition
members_of
Mechanistic interpretability approach decomposing attention heads into query/key subcomponents with distinct algorithmic roles
Distributed computation across attention heads
members_of
Studies how query, key, and value components decompose into specialized subfunctions across heads, enabling routing and token prediction behaviors.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Identification of algorithms implemented in attention layers, distributed across attention headsfinding0.890
VPD successfully recovered interpretable attention algorithms (previous-token behavior, syntax-boundary routing) in weight space without requiring manual decomposition across heads.
Attention computations distribute across heads via parameter subcomponents with interpretable rolesfinding0.834
Mechanistic discovery about how attention mechanisms decompose into interpretable parameter components.
How can mechanistic interpretability methods automatically identify attention computations that span multiple attention heads?question0.806
Long-standing bottleneck in mechanistic interpretability that VPD addresses by working natively on attention weight matrices.
Each attention head has two largely independent computations: a QK circuit computing the attention pattern and an OV circuit computing the effect if attended toclaim0.803
Key decomposition enabling separate analysis of where attention goes and what it does
Attention is a generalization of convolution; all convolutions can be expressed as tensor products of fixed relative position attention patterns and weight matricesclaim0.801
Mathematical equivalence showing the relationship between attention mechanisms and convolutional operations
Most attention heads in one-layer models dedicate an enormous fraction of their capacity to copying behaviorclaim0.797
Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
attention computationconcept0.794
Process using Q, K, V to compute a heat map over K and weighted sum of V.
Some attention heads partially specialize in copying for words that split into two tokens without a space prefix, attending from fragmented token to complete tokenfinding0.791
Interesting special case of copying behavior related to tokenization artifacts; primitive precursor to induction heads