finding
active
finding:at-layer-0-5-detection-adjusted-logit-difference-is-3-19-and-control-increase-is-3-22-a-difference-of-only-0-03-logitsAt layer 0 α=5, detection-adjusted logit difference is +3.19 and control increase is +3.22, a difference of only 0.03 logits
Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Primary negative finding reinterpreted as methodological claim: binary paradigm is invalid for testing introspection
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key quantitative evidence that detection signal is identical to global logit shift confound
- Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is appliedfinding0.833The misleadingly high result that prior paradigm would report as evidence of introspection
- Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
- Quantitative evidence that binary detection provides no genuine introspection signal beyond global logit shifts
- Model baseline logit difference ΔL_baseline = -3.96, indicating prior preference for 'NO' responsesfinding0.774Establishes the model's prior YES/NO bias, needed to interpret detection accuracies
- Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative
- Shows that explicit instructions delay the emergence of truth directions in arithmetic tasks.
- Quantifies the information gain from using logit-based expected value over greedy or sampled decoding