finding

active

finding:logit-lens-prediction-accuracy-is-near-chance-at-layer-4-28-after-injection-at-l2-6

Logit lens prediction accuracy is near-chance at layer 4 (28%) after injection at L2, α=6

Shows that signal integration into explicit prediction has barely begun immediately after injection

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Claims (2)

claim

Late-layer injection fails both because there is insufficient computational depth for integration and because residual recovery dynamics attenuate the perturbation before it influences output logits
supports
Mechanistic account explaining why late-layer introspection fails, combining two independent explanatory factors
Signal integration from early perturbation into an explicit prediction requires substantial downstream computation spanning layers 4-20
supports
Mechanistic characterization based on logit lens analysis showing gradual accuracy rise across layers

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Automated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chancefinding0.799
Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects
Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.761
Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude
Binary detection accuracy (up to 97.3% at L0 α=5) is entirely explained by global logit shifts (r=0.999 correlation with control)finding0.760
Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
All 32 attention heads at layer 3 achieve 100% localization accuracy for injections at layer 2 (5-way classification, 20% chance)finding0.751
Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
Sentence localization accuracy reaches 88% at layer 2, α=5 vs. 10% chance in 10-way classificationfinding0.742
Highest localization accuracy achieved, showing strong partial introspection for early-layer injections
Correlation r=0.999 between detection-adjusted logit difference and control logit increase across all 40 layer-strength configurationsfinding0.738
Key quantitative evidence that detection signal is identical to global logit shift confound
At layer 0 α=5, detection-adjusted logit difference is +3.19 and control increase is +3.22, a difference of only 0.03 logitsfinding0.732
Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
Logit self-report drift positive for all three LLaMA sizes (turn slopes 0.159, 0.038, 0.141; all p<10⁻²⁰) but does not increase monotonically with scalefinding0.731
Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative