claim
active
claim:most-attention-heads-in-one-layer-models-dedicate-an-enormous-fraction-of-their-capacity-to-copying-behavior

Most attention heads in one-layer models dedicate an enormous fraction of their capacity to copying behavior

Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying

Source paper

extracted_from
A Mathematical Framework for Transformer Circuits
(2021) ·

Neighborhood — ranked by edge-count

Findings (1)

finding

Questions (1)

question

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.