finding
active
finding:logit-lens-prediction-accuracy-is-near-chance-at-layer-4-28-after-injection-at-l2-6Logit lens prediction accuracy is near-chance at layer 4 (28%) after injection at L2, α=6
Shows that signal integration into explicit prediction has barely begun immediately after injection
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Claims (2)
claim
- Mechanistic account explaining why late-layer introspection fails, combining two independent explanatory factors
- Mechanistic characterization based on logit lens analysis showing gradual accuracy rise across layers
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Automated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chancefinding0.799Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects
- Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.761Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude
- Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
- Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
- Sentence localization accuracy reaches 88% at layer 2, α=5 vs. 10% chance in 10-way classificationfinding0.742Highest localization accuracy achieved, showing strong partial introspection for early-layer injections
- Key quantitative evidence that detection signal is identical to global logit shift confound
- Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
- Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative