finding
active
finding:binary-detection-accuracy-up-to-97-3-at-l0-5-is-entirely-explained-by-global-logit-shifts-r-0-999-correlation-with-controlBinary detection accuracy (up to 97.3% at L0 α=5) is entirely explained by global logit shifts (r=0.999 correlation with control)
Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Primary negative finding reinterpreted as methodological claim: binary paradigm is invalid for testing introspection
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is appliedfinding0.882The misleadingly high result that prior paradigm would report as evidence of introspection
- Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
- Key quantitative evidence that detection signal is identical to global logit shift confound
- Automated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chancefinding0.786Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects
- Quantitative evidence that binary detection provides no genuine introspection signal beyond global logit shifts
- Table 2, row 3, showing equivalence when prior preferences match rewards.
- Logit lens prediction accuracy is near-chance at layer 4 (28%) after injection at L2, α=6finding0.760Shows that signal integration into explicit prediction has barely begun immediately after injection
- Quantifies the information gain from using logit-based expected value over greedy or sampled decoding