finding

active

finding:in-the-analyzed-two-layer-attention-only-model-only-k-composition-is-significant-v-and-q-composition-are-negligible-by-frobenius-norm-measure

In the analyzed two-layer attention-only model, only K-composition is significant; V- and Q-composition are negligible by Frobenius norm measure

Result from applying the Frobenius norm composition measurement to all attention head pairs in the two-layer model

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Neighborhood — ranked by edge-count

Claims (1)

claim

In small two-layer attention-only transformers, the only significant composition is K-composition between a single first-layer head and some second-layer heads
supports
Empirical observation from the specific two-layer model analyzed; no significant V- or Q-composition found

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Second-order virtual attention head terms (V-composition) have a small marginal effect in two-layer attention-only modelsclaim0.846
Finding from term importance analysis; allows focus on individual head terms rather than their compositions
Naive interpretation of attention patterns can be both informative and fundamentally misleading when Q-, K-, or V-composition is presentclaim0.823
Response to the 'attention as explanation' critique; the paper provides a typology of when attention is and isn't directly interpretable
Second-order virtual attention head terms contribute negligible marginal loss reduction in the analyzed two-layer attention-only modelfinding0.815
Result of term importance analysis ablation experiment; justifies focusing on individual head terms
In the analyzed two-layer model, second-layer attention head terms dominate the loss reduction compared to first-layer terms and the direct pathfinding0.808
Result from term importance analysis breaking down loss contribution by layer
Most attention heads in one-layer models dedicate an enormous fraction of their capacity to copying behaviorclaim0.778
Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
Two-layer attention-only transformers implement much more complex algorithms via composition of attention heads, detectable directly from weightsclaim0.774
Core claim for two-layer models; composition creates qualitatively more powerful in-context learning
10 out of 12 attention heads in the 12-head one-layer model show significantly positive eigenvalue sums, indicating copying behaviorfinding0.768
Quantitative result from eigenvalue analysis of expanded OV matrices; confirmed by qualitative inspection
We revealed the one-layer attention-only model to be a compressed Chinese room, and we're left with a giant pile of cards.quote0.766
Vivid characterization of the limits of understanding after converting to skip-trigram form: no algorithmic mystery remains but the sheer scale prevents holistic comprehension