finding

active

finding:net-detection-signal-detection-minus-control-is-near-zero-across-all-40-layer-strength-configurations-mean-0-01-0-03-logits

Net detection signal (detection minus control) is near-zero across all 40 layer-strength configurations: mean = -0.01 ± 0.03 logits

Quantitative evidence that binary detection provides no genuine introspection signal beyond global logit shifts

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Apparent success on binary detection tasks is entirely explained by mechanical logit shifts that bias models toward affirmative responses regardless of question content
supports
Primary negative finding reinterpreted as methodological claim: binary paradigm is invalid for testing introspection

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

At layer 0 α=5, detection-adjusted logit difference is +3.19 and control increase is +3.22, a difference of only 0.03 logitsfinding0.819
Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
Correlation r=0.999 between detection-adjusted logit difference and control logit increase across all 40 layer-strength configurationsfinding0.799
Key quantitative evidence that detection signal is identical to global logit shift confound
Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is appliedfinding0.786
The misleadingly high result that prior paradigm would report as evidence of introspection
Binary detection accuracy (up to 97.3% at L0 α=5) is entirely explained by global logit shifts (r=0.999 correlation with control)finding0.768
Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
Thought detection peaks at ~2/3 layer depth; intention checking peaks at ~1/2 layer depth.finding0.749
Lindsey (2026) differential layer performance explained by Janus's path combinatorics — different tasks use different path distributions.
Peak layer ℓ* median 10, IQR 0.384finding0.744
Median layer where S(ℓ) peaks, across seeds.
Math/code tasks S ≈ -1.65 at layers 8–12finding0.732
Task-specific peak anchoring score for structured reasoning domains.
Mean-difference patching in a two-layer ReLU circuit flips the decision to class-A by activating a third hidden unit that is silent for all natural class-A inputsfinding0.731
Synthetic theoretical example showing pernicious divergence via hidden pathway activation