claim
active
claim:second-order-virtual-attention-head-terms-v-composition-have-a-small-marginal-effect-in-two-layer-attention-only-modelsSecond-order virtual attention head terms (V-composition) have a small marginal effect in two-layer attention-only models
Finding from term importance analysis; allows focus on individual head terms rather than their compositions
Neighborhood — ranked by edge-count
Findings (1)
finding
- Second-order virtual attention head terms contribute negligible marginal loss reduction in the analyzed two-layer attention-only modelrestatessupportsResult of term importance analysis ablation experiment; justifies focusing on individual head terms
Hypotheses (1)
hypothesis
- Forward-looking speculation based on the theoretical elegance and combinatorial growth of virtual head count with depth
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Result from applying the Frobenius norm composition measurement to all attention head pairs in the two-layer model
- Result from term importance analysis breaking down loss contribution by layer
- Response to the 'attention as explanation' critique; the paper provides a typology of when attention is and isn't directly interpretable
- Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
- Empirical observation from the specific two-layer model analyzed; no significant V- or Q-composition found
- Key decomposition enabling separate analysis of where attention goes and what it does
- Suggests LLMs do not represent complement/MSV linguistic features in the same way as they are crucial for human ToM development.
- Application to transformer language models
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.