claim
active
claim:attention-algorithms-are-usually-distributed-across-attention-headsAttention algorithms are usually distributed across attention heads
Claim supported by VPD's recovery of cross-head attention subcomponents, noted in footnote.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Findings (2)
finding
- VPD recovers an attention algorithm for attending to the previous token, distributed across multiple heads.
- VPD recovers an attention algorithm for routing across syntactic boundaries, distributed across heads.
Communities (4)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Identifies distributed algorithms implemented across attention heads, with focus on causal masking limitations and emergent capabilities via activation manifold steering.
- Distributed attention head decompositionmembers_ofMechanistic interpretability approach decomposing attention heads into query/key subcomponents with distinct algorithmic roles
- Studies how query, key, and value components decompose into specialized subfunctions across heads, enabling routing and token prediction behaviors.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Identification of algorithms implemented in attention layers, distributed across attention headsfinding0.890VPD successfully recovered interpretable attention algorithms (previous-token behavior, syntax-boundary routing) in weight space without requiring manual decomposition across heads.
- Attention computations distribute across heads via parameter subcomponents with interpretable rolesfinding0.834Mechanistic discovery about how attention mechanisms decompose into interpretable parameter components.
- Long-standing bottleneck in mechanistic interpretability that VPD addresses by working natively on attention weight matrices.
- Key decomposition enabling separate analysis of where attention goes and what it does
- Mathematical equivalence showing the relationship between attention mechanisms and convolutional operations
- Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
- Process using Q, K, V to compute a heat map over K and weighted sum of V.
- Interesting special case of copying behavior related to tokenization artifacts; primitive precursor to induction heads