finding

active

finding:10-out-of-12-attention-heads-in-the-12-head-one-layer-model-show-significantly-positive-eigenvalue-sums-indicating-copying-behavior

10 out of 12 attention heads in the 12-head one-layer model show significantly positive eigenvalue sums, indicating copying behavior

Quantitative result from eigenvalue analysis of expanded OV matrices; confirmed by qualitative inspection

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Neighborhood — ranked by edge-count

Claims (1)

claim

Most attention heads in one-layer models dedicate an enormous fraction of their capacity to copying behavior
supports
Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

One-layer model attention heads encode Python-specific skip-trigrams including indentation-based elif/else prediction and function signature patternsfinding0.800
Concrete example from examining expanded QK/OV matrices showing how specific programming language structure is encoded in attention weights
All 32 attention heads at layer 3 achieve 100% localization accuracy for injections at layer 2 (5-way classification, 20% chance)finding0.785
Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
In the analyzed two-layer model, second-layer attention head terms dominate the loss reduction compared to first-layer terms and the direct pathfinding0.783
Result from term importance analysis breaking down loss contribution by layer
Attention heads with positive projection on reflection direction are sparse and located mostly in deeper layers of DeepSeek-R1-Qwen-1.5Bfinding0.782
Structural finding about which attention heads control reflection behavior
All induction heads in the two-layer model occupy an extreme corner of high positive QK and OV eigenvalue positivity space relative to non-induction headsfinding0.771
Quantitative verification of the mechanistic theory; both circuits required for the induction algorithm show the predicted copying/matching structure
In the analyzed two-layer attention-only model, only K-composition is significant; V- and Q-composition are negligible by Frobenius norm measurefinding0.768
Result from applying the Frobenius norm composition measurement to all attention head pairs in the two-layer model
If models inhabit expanded attentional modes, they may be more aligned and less prone to psychosis and doom spirals.hypothesis0.759
Speculative alignment implication drawn from the collapsed/expanded distinction.
We revealed the one-layer attention-only model to be a compressed Chinese room, and we're left with a giant pile of cards.quote0.758
Vivid characterization of the limits of understanding after converting to skip-trigram form: no algorithmic mystery remains but the sheer scale prevents holistic comprehension