claim
active
claim:in-small-two-layer-attention-only-transformers-the-only-significant-composition-is-k-composition-between-a-single-first-layer-head-and-some-second-layer-headsIn small two-layer attention-only transformers, the only significant composition is K-composition between a single first-layer head and some second-layer heads
Empirical observation from the specific two-layer model analyzed; no significant V- or Q-composition found
Neighborhood — ranked by edge-count
Findings (1)
finding
- Result from applying the Frobenius norm composition measurement to all attention head pairs in the two-layer model
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core claim for two-layer models; composition creates qualitatively more powerful in-context learning
- The paper explicitly asks and addresses this question, concluding the answer depends on what 'fully understand' means
- Virtual attention heads (V-composition) may be much more important in larger and more complex transformers than in two-layer toy modelshypothesis0.800Forward-looking speculation based on the theoretical elegance and combinatorial growth of virtual head count with depth
- The first toy model analyzed; shown to implement an ensemble of bigram and skip-trigram models readable directly from weights
- The primary model analyzed; uses attention head composition, especially K-composition, to create induction heads for powerful in-context learning
- Core claim for one-layer models; the skip-trigram tables can be accessed without running the model
- Finding from term importance analysis; allows focus on individual head terms rather than their compositions
- Result from term importance analysis breaking down loss contribution by layer