Attention-Only Transformer

A simplified transformer variant without MLP layers, used as the primary subject of mechanistic analysis in this paper

Neighborhood — ranked by edge-count

concept

One-Layer Attention-Only Transformer
related_to
The first toy model analyzed; shown to implement an ensemble of bigram and skip-trigram models readable directly from weights
Two-Layer Attention-Only Transformer
related_to
The primary model analyzed; uses attention head composition, especially K-composition, to create induction heads for powerful in-context learning

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Two-layer attention-only transformers implement much more complex algorithms via composition of attention heads, detectable directly from weightsclaim0.766
Core claim for two-layer models; composition creates qualitatively more powerful in-context learning
Do we 'fully understand' one-layer attention-only transformers?question0.749
The paper explicitly asks and addresses this question, concluding the answer depends on what 'fully understand' means
Self-attentionconcept0.746
A form of key-query attention within a single input sequence; core to Transformers.
One-layer attention-only transformers are an ensemble of bigram and skip-trigram models whose parameters can be read directly from weightsclaim0.742
Core claim for one-layer models; the skip-trigram tables can be accessed without running the model
attention mechanismconcept0.722
Core operation in transformers, computing weighted combinations of previous elements
Attention headsconcept0.722
Transformer attention heads that could be recruited to extract different kinds of information (text vs. thoughts).
Signal Transformerconcept0.721
Core abstraction in Fruit: pure function mapping signals to signals; enables compositional GUI definitions.
Attention is All You Need (Vaswani et al., 2017)concept0.714
Original transformer paper; foundational reference cited throughout for the architecture being analyzed.