finding
active
finding:cosine-similarity-between-perturbed-and-baseline-residual-streams-returns-toward-1-0-and-projection-onto-injection-direction-decays-exponentially-over-subsequent-layersCosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layers
Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Mechanistic account explaining why late-layer introspection fails, combining two independent explanatory factors
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Appendix E replication of DIM alignment finding in Qwen model
- Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
- Core result of Experiment 3: cross-model semantic convergence under self-referential processing
- Proposed future application of the Assistant Axis
- Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
- Feature extraction method computing cosine similarity of hidden representations with reflection direction across all layers
- Mechanistic explanation for discrepancy with Banayeeanzade et al.; addressed by centroid unit and unbounded sweep contributions
- Demonstrates emotion-specific persistence beyond variance effects in Cogito