concept
active
concept:attention-only-transformerAttention-Only Transformer
A simplified transformer variant without MLP layers, used as the primary subject of mechanistic analysis in this paper
Neighborhood — ranked by edge-count
Concepts (2)
concept
- One-Layer Attention-Only Transformerrelated_toThe first toy model analyzed; shown to implement an ensemble of bigram and skip-trigram models readable directly from weights
- Two-Layer Attention-Only Transformerrelated_toThe primary model analyzed; uses attention head composition, especially K-composition, to create induction heads for powerful in-context learning
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core claim for two-layer models; composition creates qualitatively more powerful in-context learning
- The paper explicitly asks and addresses this question, concluding the answer depends on what 'fully understand' means
- A form of key-query attention within a single input sequence; core to Transformers.
- Core claim for one-layer models; the skip-trigram tables can be accessed without running the model
- Core operation in transformers, computing weighted combinations of previous elements
- Transformer attention heads that could be recruited to extract different kinds of information (text vs. thoughts).
- Core abstraction in Fruit: pure function mapping signals to signals; enables compositional GUI definitions.
- Original transformer paper; foundational reference cited throughout for the architecture being analyzed.