finding

active

finding:in-the-analyzed-two-layer-model-second-layer-attention-head-terms-dominate-the-loss-reduction-compared-to-first-layer-terms-and-the-direct-path

In the analyzed two-layer model, second-layer attention head terms dominate the loss reduction compared to first-layer terms and the direct path

Result from term importance analysis breaking down loss contribution by layer

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Second-order virtual attention head terms contribute negligible marginal loss reduction in the analyzed two-layer attention-only modelfinding0.893
Result of term importance analysis ablation experiment; justifies focusing on individual head terms
Most attention heads in one-layer models dedicate an enormous fraction of their capacity to copying behaviorclaim0.832
Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
Second-order virtual attention head terms (V-composition) have a small marginal effect in two-layer attention-only modelsclaim0.831
Finding from term importance analysis; allows focus on individual head terms rather than their compositions
In the analyzed two-layer attention-only model, only K-composition is significant; V- and Q-composition are negligible by Frobenius norm measurefinding0.808
Result from applying the Frobenius norm composition measurement to all attention head pairs in the two-layer model
All 32 attention heads at layer 3 achieve 100% localization accuracy for injections at layer 2 (5-way classification, 20% chance)finding0.793
Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
One-layer model attention heads encode Python-specific skip-trigrams including indentation-based elif/else prediction and function signature patternsfinding0.787
Concrete example from examining expanded QK/OV matrices showing how specific programming language structure is encoded in attention weights
Two-layer attention-only transformers implement much more complex algorithms via composition of attention heads, detectable directly from weightsclaim0.785
Core claim for two-layer models; composition creates qualitatively more powerful in-context learning
Attention heads can be understood as independent operations each adding their output to the residual stream, equivalent to the concatenate-and-multiply formulationclaim0.784
Mathematical equivalence enabling independent analysis of each attention head