claim
active
claim:late-layer-injection-fails-both-because-there-is-insufficient-computational-depth-for-integration-and-because-residual-recovery-dynamics-attenuate-the-perturbation-before-it-influences-output-logitsLate-layer injection fails both because there is insufficient computational depth for integration and because residual recovery dynamics attenuate the perturbation before it influences output logits
Mechanistic account explaining why late-layer introspection fails, combining two independent explanatory factors
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Findings (3)
finding
- Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
- Shows that signal integration into explicit prediction has barely begun immediately after injection
- Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
Concepts (1)
concept
- The network's tendency to actively attenuate injected perturbations over subsequent layers, erasing the signal before output
Claims (1)
claim
- Mechanistic characterization based on logit lens analysis showing gradual accuracy rise across layers
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- We hypothesize earlier-layer interventions allow more downstream computation to process and potentially correct the perturbationhypothesis0.744Post-hoc explanation for why steering at layer 33 rather than layer 50 produced better ESR behavior in Llama-3.3-70B
- The middle layer residual stream features are causally implicated in multi-step reasoning.claim0.737Features for Kobe Bryant, California, Lakers participate in computing the capital answer.
- Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
- Diagnosis of second failure mode explaining low harness-benefit for weak-tier models
- Interpretation of LAT scanning results showing layer-dependent deception detection accuracy
- Methodological critique of prior work that fixed a single layer for truth probing.
- Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations
- Performance is best when skipping both the first and last six layers when applying interventionclaim0.703Empirical configuration finding from ablation study on layer selection