finding
active
finding:net-detection-signal-detection-minus-control-is-near-zero-across-all-40-layer-strength-configurations-mean-0-01-0-03-logitsNet detection signal (detection minus control) is near-zero across all 40 layer-strength configurations: mean = -0.01 ± 0.03 logits
Quantitative evidence that binary detection provides no genuine introspection signal beyond global logit shifts
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Primary negative finding reinterpreted as methodological claim: binary paradigm is invalid for testing introspection
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
- Key quantitative evidence that detection signal is identical to global logit shift confound
- Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is appliedfinding0.786The misleadingly high result that prior paradigm would report as evidence of introspection
- Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
- Thought detection peaks at ~2/3 layer depth; intention checking peaks at ~1/2 layer depth.finding0.749Lindsey (2026) differential layer performance explained by Janus's path combinatorics — different tasks use different path distributions.
- Median layer where S(ℓ) peaks, across seeds.
- Task-specific peak anchoring score for structured reasoning domains.
- Synthetic theoretical example showing pernicious divergence via hidden pathway activation