finding
active
finding:binary-detection-adjusted-accuracy-reaches-97-3-at-layer-0-with-5-before-baseline-control-is-appliedBinary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is applied
The misleadingly high result that prior paradigm would report as evidence of introspection
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Critical methodological claim directed at Lindsey 2026 and similar work using binary detection
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
- Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
- Accuracy at k=16 shots for B10.
- Key quantitative evidence that detection signal is identical to global logit shift confound
- Quantitative evidence that binary detection provides no genuine introspection signal beyond global logit shifts
- Accuracy at k=16 shots for B8.
- Accuracy at k=16 shots for B9.
- Out-of-domain generalization showing deception features track general representational honesty