claim
active
claim:two-layer-attention-only-transformers-implement-much-more-complex-algorithms-via-composition-of-attention-heads-detectable-directly-from-weights

Two-layer attention-only transformers implement much more complex algorithms via composition of attention heads, detectable directly from weights

Core claim for two-layer models; composition creates qualitatively more powerful in-context learning

Source paper

extracted_from
A Mathematical Framework for Transformer Circuits
(2021) ·

Neighborhood — ranked by edge-count

Findings (1)

finding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.