finding

active

finding:all-32-attention-heads-at-layer-3-achieve-100-localization-accuracy-for-injections-at-layer-2-5-way-classification-20-chance

All 32 attention heads at layer 3 achieve 100% localization accuracy for injections at layer 2 (5-way classification, 20% chance)

Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Claims (3)

claim

Introspective capabilities are confined to early-layer injections (L0-L5) and collapse to chance thereafter
supports
Key quantitative characterization of the layer-dependence of partial introspection
Late-layer injection fails both because there is insufficient computational depth for integration and because residual recovery dynamics attenuate the perturbation before it influences output logits
supports
Mechanistic account explaining why late-layer introspection fails, combining two independent explanatory factors
Introspection relies on general-purpose computational mechanisms—attention-based anomaly detection and residual stream dynamics—rather than specialized introspection circuits
supports
Interpretive claim about the mechanistic substrate of introspection in LLMs

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Layer 27 (last layer) has largest projection magnitude on the reflection direction among all attention head layers in DeepSeek-R1-Qwen-1.5Bfinding0.806
Attribution finding suggesting the last layer directly controls reflection keyword generation
Attention heads with positive projection on reflection direction are sparse and located mostly in deeper layers of DeepSeek-R1-Qwen-1.5Bfinding0.802
Structural finding about which attention heads control reflection behavior
Sentence localization accuracy reaches 88% at layer 2, α=5 vs. 10% chance in 10-way classificationfinding0.802
Highest localization accuracy achieved, showing strong partial introspection for early-layer injections
Most attention heads in one-layer models dedicate an enormous fraction of their capacity to copying behaviorclaim0.798
Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
In the analyzed two-layer model, second-layer attention head terms dominate the loss reduction compared to first-layer terms and the direct pathfinding0.793
Result from term importance analysis breaking down loss contribution by layer
The case at approximately the 2/3 layer of LLaMA3.1-8B (Layer 24, satisfying Criteria 1 and 2) aligns with prior studies showing the 2/3 layer optimally predicts human brain activity.finding0.789
Connects this study's results to Schrimpf et al. 2021 and Caucheteux et al. 2022/2023 findings on brain-LLM alignment.
10 out of 12 attention heads in the 12-head one-layer model show significantly positive eigenvalue sums, indicating copying behaviorfinding0.785
Quantitative result from eigenvalue analysis of expanded OV matrices; confirmed by qualitative inspection
Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.783
Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude