finding
active
finding:strength-comparison-accuracy-reaches-73-at-layer-3-for-injection-pair-2-6-vs-50-chanceStrength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chance
Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Primary positive claim of the paper, grounded in strength comparison and localization results
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Strength comparison accuracy averages 47% at layers 15-30, indistinguishable from 50% chancefinding0.894Shows collapse of introspective capability at later layers in the strength comparison task
- Shows that introspective accuracy scales with injection strength difference, not binary detection
- Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
- Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2finding0.767In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
- Core E3 finding validating S as a predictor of anchoring effectiveness
- Core result of Experiment 3: cross-model semantic convergence under self-referential processing
- Logit lens prediction accuracy is near-chance at layer 4 (28%) after injection at L2, α=6finding0.761Shows that signal integration into explicit prediction has barely begun immediately after injection
- Shows S predicts anchoring effectiveness.