Do we 'fully understand' one-layer attention-only transformers?

The paper explicitly asks and addresses this question, concluding the answer depends on what 'fully understand' means

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Neighborhood — ranked by edge-count

Claims (1)

claim

One-layer attention-only transformers are an ensemble of bigram and skip-trigram models whose parameters can be read directly from weights
answered_by
Core claim for one-layer models; the skip-trigram tables can be accessed without running the model

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Two-layer attention-only transformers implement much more complex algorithms via composition of attention heads, detectable directly from weightsclaim0.869
Core claim for two-layer models; composition creates qualitatively more powerful in-context learning
One-Layer Attention-Only Transformerconcept0.866
The first toy model analyzed; shown to implement an ensemble of bigram and skip-trigram models readable directly from weights
Two-Layer Attention-Only Transformerconcept0.819
The primary model analyzed; uses attention head composition, especially K-composition, to create induction heads for powerful in-context learning
In small two-layer attention-only transformers, the only significant composition is K-composition between a single first-layer head and some second-layer headsclaim0.814
Empirical observation from the specific two-layer model analyzed; no significant V- or Q-composition found
Most attention heads in one-layer models dedicate an enormous fraction of their capacity to copying behaviorclaim0.786
Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
In the analyzed two-layer model, second-layer attention head terms dominate the loss reduction compared to first-layer terms and the direct pathfinding0.760
Result from term importance analysis breaking down loss contribution by layer
Transformers almost surely maintain input-injectivity throughout training, not just at initialisationhypothesis0.755
Conjecture supported by Nikolaou et al. 2025 for last-token hidden states
MLP layers are much harder to get traction on than attention layers; understanding them requires individually interpretable neurons which are rarely foundclaim0.754
Key limitation of the paper's approach; MLP layers make up 2/3 of standard transformer parameters