claim

active

claim:two-layer-attention-only-transformers-implement-much-more-complex-algorithms-via-composition-of-attention-heads-detectable-directly-from-weights

Two-layer attention-only transformers implement much more complex algorithms via composition of attention heads, detectable directly from weights

Core claim for two-layer models; composition creates qualitatively more powerful in-context learning

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Neighborhood — ranked by edge-count

Findings (1)

finding

Induction heads in two-layer models successfully perform in-context learning on completely random repeated token sequences far outside training distribution
supports
Strong test of the induction head hypothesis using uniformly sampled random tokens repeated three times

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Do we 'fully understand' one-layer attention-only transformers?question0.869
The paper explicitly asks and addresses this question, concluding the answer depends on what 'fully understand' means
Two-Layer Attention-Only Transformerconcept0.854
The primary model analyzed; uses attention head composition, especially K-composition, to create induction heads for powerful in-context learning
One-layer attention-only transformers are an ensemble of bigram and skip-trigram models whose parameters can be read directly from weightsclaim0.852
Core claim for one-layer models; the skip-trigram tables can be accessed without running the model
In small two-layer attention-only transformers, the only significant composition is K-composition between a single first-layer head and some second-layer headsclaim0.850
Empirical observation from the specific two-layer model analyzed; no significant V- or Q-composition found
One-Layer Attention-Only Transformerconcept0.844
The first toy model analyzed; shown to implement an ensemble of bigram and skip-trigram models readable directly from weights
In the analyzed two-layer model, second-layer attention head terms dominate the loss reduction compared to first-layer terms and the direct pathfinding0.785
Result from term importance analysis breaking down loss contribution by layer
Most attention heads in one-layer models dedicate an enormous fraction of their capacity to copying behaviorclaim0.783
Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
Identification of algorithms implemented in attention layers, distributed across attention headsfinding0.782
VPD successfully recovered interpretable attention algorithms (previous-token behavior, syntax-boundary routing) in weight space without requiring manual decomposition across heads.