claim

active

claim:second-order-virtual-attention-head-terms-v-composition-have-a-small-marginal-effect-in-two-layer-attention-only-models

Second-order virtual attention head terms (V-composition) have a small marginal effect in two-layer attention-only models

Finding from term importance analysis; allows focus on individual head terms rather than their compositions

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Neighborhood — ranked by edge-count

Findings (1)

finding

Second-order virtual attention head terms contribute negligible marginal loss reduction in the analyzed two-layer attention-only model
restatessupports
Result of term importance analysis ablation experiment; justifies focusing on individual head terms

Hypotheses (1)

hypothesis

Virtual attention heads (V-composition) may be much more important in larger and more complex transformers than in two-layer toy models
supports
Forward-looking speculation based on the theoretical elegance and combinatorial growth of virtual head count with depth

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In the analyzed two-layer attention-only model, only K-composition is significant; V- and Q-composition are negligible by Frobenius norm measurefinding0.846
Result from applying the Frobenius norm composition measurement to all attention head pairs in the two-layer model
In the analyzed two-layer model, second-layer attention head terms dominate the loss reduction compared to first-layer terms and the direct pathfinding0.831
Result from term importance analysis breaking down loss contribution by layer
Naive interpretation of attention patterns can be both informative and fundamentally misleading when Q-, K-, or V-composition is presentclaim0.814
Response to the 'attention as explanation' critique; the paper provides a typology of when attention is and isn't directly interpretable
Most attention heads in one-layer models dedicate an enormous fraction of their capacity to copying behaviorclaim0.780
Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
In small two-layer attention-only transformers, the only significant composition is K-composition between a single first-layer head and some second-layer headsclaim0.771
Empirical observation from the specific two-layer model analyzed; no significant V- or Q-composition found
Each attention head has two largely independent computations: a QK circuit computing the attention pattern and an OV circuit computing the effect if attended toclaim0.770
Key decomposition enabling separate analysis of where attention goes and what it does
Directing response attention to complement syntax and/or mental state verbs (MSV) yields no significant alterations in IIT estimates compared to entire stimulus analysis.finding0.768
Suggests LLMs do not represent complement/MSV linguistic features in the same way as they are crucial for human ToM development.
Causally-masked attention in a decoder-only model has no ordered phase (Proposition 2)finding0.766
Application to transformer language models

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

finding
Second-order virtual attention head terms contribute negligible marginal loss reduction in the analyzed two-layer attention-only model