finding
active
finding:a-pair-of-query-and-key-subcomponents-distributed-across-attention-heads-performs-previous-token-behaviorA pair of query and key subcomponents distributed across attention heads performs previous-token behavior
VPD recovers an attention algorithm for attending to the previous token, distributed across multiple heads.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Claims (1)
claim
- Claim supported by VPD's recovery of cross-head attention subcomponents, noted in footnote.
Communities (4)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Identifies distributed algorithms implemented across attention heads, with focus on causal masking limitations and emergent capabilities via activation manifold steering.
- Distributed attention head decompositionmembers_ofMechanistic interpretability approach decomposing attention heads into query/key subcomponents with distinct algorithmic roles
- Studies how query, key, and value components decompose into specialized subfunctions across heads, enabling routing and token prediction behaviors.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- An attention algorithm recovered by VPD where the model attends to the immediately preceding token.
- A pair of query and key subcomponents distributed across attention heads performs syntax-boundary routingfinding0.831VPD recovers an attention algorithm for routing across syntactic boundaries, distributed across heads.
- Interesting special case of copying behavior related to tokenization artifacts; primitive precursor to induction heads
- Attention computations distribute across heads via parameter subcomponents with interpretable rolesfinding0.806Mechanistic discovery about how attention mechanisms decompose into interpretable parameter components.
- The mechanistic explanation of how induction heads are implemented in two-layer models
- An attention head that primarily attends to the immediately preceding token; key building block for induction heads via K-composition
- Reframing observation: the canonical K/Q/V decomposition is computationally convenient but not the most interpretable representation
- Describes the properties of the functional token.