finding
active
finding:second-order-virtual-attention-head-terms-contribute-negligible-marginal-loss-reduction-in-the-analyzed-two-layer-attention-only-modelSecond-order virtual attention head terms contribute negligible marginal loss reduction in the analyzed two-layer attention-only model
Result of term importance analysis ablation experiment; justifies focusing on individual head terms
Neighborhood — ranked by edge-count
Claims (1)
claim
- Second-order virtual attention head terms (V-composition) have a small marginal effect in two-layer attention-only modelsrestatessupportsFinding from term importance analysis; allows focus on individual head terms rather than their compositions
Hypotheses (1)
hypothesis
- Forward-looking speculation based on the theoretical elegance and combinatorial growth of virtual head count with depth
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Result from term importance analysis breaking down loss contribution by layer
- Result from applying the Frobenius norm composition measurement to all attention head pairs in the two-layer model
- Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
- Response to the 'attention as explanation' critique; the paper provides a typology of when attention is and isn't directly interpretable
- Key decomposition enabling separate analysis of where attention goes and what it does
- Mathematical equivalence enabling independent analysis of each attention head
- Core claim for two-layer models; composition creates qualitatively more powerful in-context learning
- Predictive hypothesis driving the investigation in Section 3.3; supported by experimental evidence.
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.