One-Layer Attention-Only Transformer

The first toy model analyzed; shown to implement an ensemble of bigram and skip-trigram models readable directly from weights

Neighborhood — ranked by edge-count

Concepts (3)

concept

Attention-Only Transformer
related_to
A simplified transformer variant without MLP layers, used as the primary subject of mechanistic analysis in this paper
Two-Layer Attention-Only Transformer
related_to
The primary model analyzed; uses attention head composition, especially K-composition, to create induction heads for powerful in-context learning
Skip-Trigram
implements
A three-token pattern of the form [source]...[destination][out] that one-layer attention heads implement; the paper's key characterization of one-layer transformer behavior

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Do we 'fully understand' one-layer attention-only transformers?question0.866
The paper explicitly asks and addresses this question, concluding the answer depends on what 'fully understand' means
Two-layer attention-only transformers implement much more complex algorithms via composition of attention heads, detectable directly from weightsclaim0.844
Core claim for two-layer models; composition creates qualitatively more powerful in-context learning
One-layer attention-only transformers are an ensemble of bigram and skip-trigram models whose parameters can be read directly from weightsclaim0.836
Core claim for one-layer models; the skip-trigram tables can be accessed without running the model
In small two-layer attention-only transformers, the only significant composition is K-composition between a single first-layer head and some second-layer headsclaim0.787
Empirical observation from the specific two-layer model analyzed; no significant V- or Q-composition found
Zero-Layer Transformerconcept0.767
A transformer with no attention layers; shown to model bigram statistics via T = W_U W_E
We revealed the one-layer attention-only model to be a compressed Chinese room, and we're left with a giant pile of cards.quote0.742
Vivid characterization of the limits of understanding after converting to skip-trigram form: no algorithmic mystery remains but the sheer scale prevents holistic comprehension
The last layer of the transformer has the largest projection magnitude on the reflection direction, likely because it directly controls generation of reflection keywordsclaim0.725
Interpretive claim from attention head attribution analysis in appendix
Most attention heads in one-layer models dedicate an enormous fraction of their capacity to copying behaviorclaim0.718
Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying