finding
active
finding:cosine-similarity-between-perturbed-and-baseline-residual-streams-returns-toward-1-0-and-projection-onto-injection-direction-decays-exponentially-over-subsequent-layers

Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layers

Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure

Source paper

extracted_from
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.