Two-Layer Attention-Only Transformer

The primary model analyzed; uses attention head composition, especially K-composition, to create induction heads for powerful in-context learning

Neighborhood — ranked by edge-count

Concepts (3)

concept

Attention-Only Transformer
related_to
A simplified transformer variant without MLP layers, used as the primary subject of mechanistic analysis in this paper
One-Layer Attention-Only Transformer
related_to
The first toy model analyzed; shown to implement an ensemble of bigram and skip-trigram models readable directly from weights
Induction Heads
implements
Mechanistic circuits in transformers documented by Olsson et al. 2022, cited as evidence for pattern-repository assumption

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Two-layer attention-only transformers implement much more complex algorithms via composition of attention heads, detectable directly from weightsclaim0.854
Core claim for two-layer models; composition creates qualitatively more powerful in-context learning
Do we 'fully understand' one-layer attention-only transformers?question0.819
The paper explicitly asks and addresses this question, concluding the answer depends on what 'fully understand' means
One-layer attention-only transformers are an ensemble of bigram and skip-trigram models whose parameters can be read directly from weightsclaim0.800
Core claim for one-layer models; the skip-trigram tables can be accessed without running the model
In small two-layer attention-only transformers, the only significant composition is K-composition between a single first-layer head and some second-layer headsclaim0.777
Empirical observation from the specific two-layer model analyzed; no significant V- or Q-composition found
Zero-Layer Transformerconcept0.745
A transformer with no attention layers; shown to model bigram statistics via T = W_U W_E
We revealed the one-layer attention-only model to be a compressed Chinese room, and we're left with a giant pile of cards.quote0.709
Vivid characterization of the limits of understanding after converting to skip-trigram form: no algorithmic mystery remains but the sheer scale prevents holistic comprehension
The transformer entity is tricameral (base simulator, simulated simulator, simulated awareness), but there is less discreteness between these layers than previously claimed.claim0.702
Antra's revision of her earlier model; still considers interference between levels important.
In the analyzed two-layer model, second-layer attention head terms dominate the loss reduction compared to first-layer terms and the direct pathfinding0.691
Result from term importance analysis breaking down loss contribution by layer