finding

active

finding:binary-detection-adjusted-accuracy-reaches-97-3-at-layer-0-with-5-before-baseline-control-is-applied

Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is applied

The misleadingly high result that prior paradigm would report as evidence of introspection

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Prior experimental paradigms may overestimate introspective capabilities by conflating genuine self-awareness with uniform output distribution shifts
supports
Critical methodological claim directed at Lindsey 2026 and similar work using binary detection

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Binary detection accuracy (up to 97.3% at L0 α=5) is entirely explained by global logit shifts (r=0.999 correlation with control)finding0.882
Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
At layer 0 α=5, detection-adjusted logit difference is +3.19 and control increase is +3.22, a difference of only 0.03 logitsfinding0.833
Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
B10 final accuracy 94.8 ± 1.2%finding0.791
Accuracy at k=16 shots for B10.
Correlation r=0.999 between detection-adjusted logit difference and control logit increase across all 40 layer-strength configurationsfinding0.789
Key quantitative evidence that detection signal is identical to global logit shift confound
Net detection signal (detection minus control) is near-zero across all 40 layer-strength configurations: mean = -0.01 ± 0.03 logitsfinding0.786
Quantitative evidence that binary detection provides no genuine introspection signal beyond global logit shifts
B8 final accuracy 92.4 ± 1.8%finding0.770
Accuracy at k=16 shots for B8.
B9 final accuracy 89.7 ± 2.1%finding0.765
Accuracy at k=16 shots for B9.
Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesfinding0.760
Out-of-domain generalization showing deception features track general representational honesty