finding
active
finding:strength-comparison-accuracy-averages-47-at-layers-15-30-indistinguishable-from-50-chanceStrength comparison accuracy averages 47% at layers 15-30, indistinguishable from 50% chance
Shows collapse of introspective capability at later layers in the strength comparison task
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Key quantitative characterization of the layer-dependence of partial introspection
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.894Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude
- Shows S predicts anchoring effectiveness.
- Key quantitative evidence that detection signal is identical to global logit shift confound
- Shows that introspective accuracy scales with injection strength difference, not binary detection
- Core E3 finding validating S as a predictor of anchoring effectiveness
- Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is appliedfinding0.755The misleadingly high result that prior paradigm would report as evidence of introspection
- Core result of Experiment 3: cross-model semantic convergence under self-referential processing
- Per-category analysis showing reflection rate does not help within difficulty class