claim

active

claim:in-small-two-layer-attention-only-transformers-the-only-significant-composition-is-k-composition-between-a-single-first-layer-head-and-some-second-layer-heads

In small two-layer attention-only transformers, the only significant composition is K-composition between a single first-layer head and some second-layer heads

Empirical observation from the specific two-layer model analyzed; no significant V- or Q-composition found

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Neighborhood — ranked by edge-count

Findings (1)

finding

In the analyzed two-layer attention-only model, only K-composition is significant; V- and Q-composition are negligible by Frobenius norm measure
supports
Result from applying the Frobenius norm composition measurement to all attention head pairs in the two-layer model

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Two-layer attention-only transformers implement much more complex algorithms via composition of attention heads, detectable directly from weightsclaim0.850
Core claim for two-layer models; composition creates qualitatively more powerful in-context learning
Do we 'fully understand' one-layer attention-only transformers?question0.814
The paper explicitly asks and addresses this question, concluding the answer depends on what 'fully understand' means
Virtual attention heads (V-composition) may be much more important in larger and more complex transformers than in two-layer toy modelshypothesis0.800
Forward-looking speculation based on the theoretical elegance and combinatorial growth of virtual head count with depth
One-Layer Attention-Only Transformerconcept0.787
The first toy model analyzed; shown to implement an ensemble of bigram and skip-trigram models readable directly from weights
Two-Layer Attention-Only Transformerconcept0.777
The primary model analyzed; uses attention head composition, especially K-composition, to create induction heads for powerful in-context learning
One-layer attention-only transformers are an ensemble of bigram and skip-trigram models whose parameters can be read directly from weightsclaim0.773
Core claim for one-layer models; the skip-trigram tables can be accessed without running the model
Second-order virtual attention head terms (V-composition) have a small marginal effect in two-layer attention-only modelsclaim0.771
Finding from term importance analysis; allows focus on individual head terms rather than their compositions
In the analyzed two-layer model, second-layer attention head terms dominate the loss reduction compared to first-layer terms and the direct pathfinding0.755
Result from term importance analysis breaking down loss contribution by layer