finding
active
finding:all-32-attention-heads-at-layer-3-achieve-100-localization-accuracy-for-injections-at-layer-2-5-way-classification-20-chanceAll 32 attention heads at layer 3 achieve 100% localization accuracy for injections at layer 2 (5-way classification, 20% chance)
Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Claims (3)
claim
- Key quantitative characterization of the layer-dependence of partial introspection
- Mechanistic account explaining why late-layer introspection fails, combining two independent explanatory factors
- Interpretive claim about the mechanistic substrate of introspection in LLMs
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Attribution finding suggesting the last layer directly controls reflection keyword generation
- Structural finding about which attention heads control reflection behavior
- Sentence localization accuracy reaches 88% at layer 2, α=5 vs. 10% chance in 10-way classificationfinding0.802Highest localization accuracy achieved, showing strong partial introspection for early-layer injections
- Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
- Result from term importance analysis breaking down loss contribution by layer
- Connects this study's results to Schrimpf et al. 2021 and Caucheteux et al. 2022/2023 findings on brain-LLM alignment.
- Quantitative result from eigenvalue analysis of expanded OV matrices; confirmed by qualitative inspection
- Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.783Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude