finding

active

finding:cosine-similarity-between-perturbed-and-baseline-residual-streams-returns-toward-1-0-and-projection-onto-injection-direction-decays-exponentially-over-subsequent-layers

Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layers

Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Late-layer injection fails both because there is insufficient computational depth for integration and because residual recovery dynamics attenuate the perturbation before it influences output logits
supports
Mechanistic account explaining why late-layer introspection fails, combining two independent explanatory factors

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In Qwen-2.5-9B, only v1 has meaningful cosine similarity to DIM direction; all additional basis vectors have cosine similarities ~1e-9finding0.791
Appendix E replication of DIM alignment finding in Qwen model
The model converges to a more stable truth direction in middle-to-late layers, as evidenced by increasing cosine similarity between layer-wise probes.claim0.784
Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
Experimental condition adjective embeddings show mean cosine similarity 0.657 (n=9,591 pairs), significantly higher than history (0.628, t=15.8, p=1.4×10⁻⁵⁵), conceptual (0.587, t=38.5, p<10⁻³⁰⁰), and zero-shot (0.603, t=35.1, p=4.3×10⁻²⁶²)finding0.780
Core result of Experiment 3: cross-model semantic convergence under self-referential processing
Projections onto the Assistant Axis could serve as a real-time measure of model coherence in deployment—a quantitative signal for when models are drifting from their intended identityclaim0.779
Proposed future application of the Assistant Axis
In Gemma-2-9B, only the first cone axis (v1) has non-negligible cosine similarity to the DIM direction; all other axes have near-zero similarity (~1e-9)finding0.772
Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
Cosine projection on reflection directionmethod0.763
Feature extraction method computing cosine similarity of hidden representations with reflection direction across all layers
Uncalibrated sweep units and restricted coefficient ranges are the primary cause of prior reports showing P2 outperforming MD injectionsclaim0.758
Mechanistic explanation for discrepancy with Banayeeanzade et al.; addressed by centroid unit and unbounded sweep contributions
In Cogito v2.1, average residual persistence above variance-matched probes is +0.077 (p = 1.5e-27, 157 of 171 probes positive).finding0.758
Demonstrates emotion-specific persistence beyond variance effects in Cogito