finding
active
finding:correlation-r-0-999-between-detection-adjusted-logit-difference-and-control-logit-increase-across-all-40-layer-strength-configurationsCorrelation r=0.999 between detection-adjusted logit difference and control logit increase across all 40 layer-strength configurations
Key quantitative evidence that detection signal is identical to global logit shift confound
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Primary negative finding reinterpreted as methodological claim: binary paradigm is invalid for testing introspection
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
- Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
- Quantitative evidence that binary detection provides no genuine introspection signal beyond global logit shifts
- Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is appliedfinding0.789The misleadingly high result that prior paradigm would report as evidence of introspection
- Strength comparison accuracy averages 47% at layers 15-30, indistinguishable from 50% chancefinding0.771Shows collapse of introspective capability at later layers in the strength comparison task
- Core E3 finding validating S as a predictor of anchoring effectiveness
- Demonstrates that activation similarity can diverge from logit weight similarity due to interference
- Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.760Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude