finding
active
finding:all-32-attention-heads-at-layer-3-achieve-100-localization-accuracy-for-injections-at-layer-2-5-way-classification-20-chance

All 32 attention heads at layer 3 achieve 100% localization accuracy for injections at layer 2 (5-way classification, 20% chance)

Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream

Source paper

extracted_from
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.