claim

active

claim:late-layer-injection-fails-both-because-there-is-insufficient-computational-depth-for-integration-and-because-residual-recovery-dynamics-attenuate-the-perturbation-before-it-influences-output-logits

Late-layer injection fails both because there is insufficient computational depth for integration and because residual recovery dynamics attenuate the perturbation before it influences output logits

Mechanistic account explaining why late-layer introspection fails, combining two independent explanatory factors

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Findings (3)

finding

All 32 attention heads at layer 3 achieve 100% localization accuracy for injections at layer 2 (5-way classification, 20% chance)
supports
Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
Logit lens prediction accuracy is near-chance at layer 4 (28%) after injection at L2, α=6
supports
Shows that signal integration into explicit prediction has barely begun immediately after injection
Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layers
supports
Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure

Concepts (1)

concept

residual stream recovery dynamics
supports
The network's tendency to actively attenuate injected perturbations over subsequent layers, erasing the signal before output

Claims (1)

claim

Signal integration from early perturbation into an explicit prediction requires substantial downstream computation spanning layers 4-20
supports
Mechanistic characterization based on logit lens analysis showing gradual accuracy rise across layers

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We hypothesize earlier-layer interventions allow more downstream computation to process and potentially correct the perturbationhypothesis0.744
Post-hoc explanation for why steering at layer 33 rather than layer 50 produced better ESR behavior in Llama-3.3-70B
The middle layer residual stream features are causally implicated in multi-step reasoning.claim0.737
Features for Kobe Bryant, California, Lakers participate in computing the capital answer.
The model converges to a more stable truth direction in middle-to-late layers, as evidenced by increasing cosine similarity between layer-wise probes.claim0.729
Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
Even when the harness is loaded, weak-tier models fail to adhere to it due to weak instruction-following over long-horizon tasks, drifting more than four times more steeply than strong modelsclaim0.715
Diagnosis of second failure mode explaining low harness-benefit for weak-tier models
Different network depths contribute differentially to the model's capacity for handling deceptive patterns, with middle-to-late layers specializing in abstract deception semanticsclaim0.712
Interpretation of LAT scanning results showing layer-dependent deception detection accuracy
Single-layer analyses can be misleading because early-layer truth directions may reflect surface features with limited cross-task generalization.claim0.709
Methodological critique of prior work that fixed a single layer for truth probing.
In intermediate regimes of scale or layer depth, LLMs may linearly represent features at intermediate levels of abstraction such as 'accurate factual recall' or 'close association' rather than abstract truthclaim0.704
Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations
Performance is best when skipping both the first and last six layers when applying interventionclaim0.703
Empirical configuration finding from ablation study on layer selection