question
active
question:do-we-fully-understand-one-layer-attention-only-transformersDo we 'fully understand' one-layer attention-only transformers?
The paper explicitly asks and addresses this question, concluding the answer depends on what 'fully understand' means
Neighborhood — ranked by edge-count
Claims (1)
claim
- Core claim for one-layer models; the skip-trigram tables can be accessed without running the model
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core claim for two-layer models; composition creates qualitatively more powerful in-context learning
- The first toy model analyzed; shown to implement an ensemble of bigram and skip-trigram models readable directly from weights
- The primary model analyzed; uses attention head composition, especially K-composition, to create induction heads for powerful in-context learning
- Empirical observation from the specific two-layer model analyzed; no significant V- or Q-composition found
- Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
- Result from term importance analysis breaking down loss contribution by layer
- Transformers almost surely maintain input-injectivity throughout training, not just at initialisationhypothesis0.755Conjecture supported by Nikolaou et al. 2025 for last-token hidden states
- Key limitation of the paper's approach; MLP layers make up 2/3 of standard transformer parameters