community

active

leiden_hybrid_concepts

label: haiku

community:leiden_hybrid_concepts-run4-c0-c0-c2

Distributed computation across attention heads

Studies how query, key, and value components decompose into specialized subfunctions across heads, enabling routing and token prediction behaviors.

4 members. Each node is clickable.

Loading graph…

Drawn from 2 sources

The papers/notes whose extracted claims & findings make up this cluster.

Paper Summary: Interpreting Language Model Parameters3 members
Janus Information Flow Transformers 20251 member

Bridges (3)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Mechanistic interpretability & model evaluation4 shared
Mechanistic structure of transformer attention computations4 shared
Distributed attention head decomposition3 shared

Claims (2)

Attention algorithms are usually distributed across attention headsClaim supported by VPD's recovery of cross-head attention subcomponents, noted in footnote.
Q/K/V values function as information routing: Q queries past, K signals future attention, V carries selectively routed information.Janus's interpretive model for how attention mechanisms enable deliberate information flow and selective routing.

Findings (2)

A pair of query and key subcomponents distributed across attention heads performs previous-token behaviorVPD recovers an attention algorithm for attending to the previous token, distributed across multiple heads.
A pair of query and key subcomponents distributed across attention heads performs syntax-boundary routingVPD recovers an attention algorithm for routing across syntactic boundaries, distributed across heads.