finding

active

finding:correlation-r-0-999-between-detection-adjusted-logit-difference-and-control-logit-increase-across-all-40-layer-strength-configurations

Correlation r=0.999 between detection-adjusted logit difference and control logit increase across all 40 layer-strength configurations

Key quantitative evidence that detection signal is identical to global logit shift confound

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Apparent success on binary detection tasks is entirely explained by mechanical logit shifts that bias models toward affirmative responses regardless of question content
supports
Primary negative finding reinterpreted as methodological claim: binary paradigm is invalid for testing introspection

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

At layer 0 α=5, detection-adjusted logit difference is +3.19 and control increase is +3.22, a difference of only 0.03 logitsfinding0.886
Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
Binary detection accuracy (up to 97.3% at L0 α=5) is entirely explained by global logit shifts (r=0.999 correlation with control)finding0.820
Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
Net detection signal (detection minus control) is near-zero across all 40 layer-strength configurations: mean = -0.01 ± 0.03 logitsfinding0.799
Quantitative evidence that binary detection provides no genuine introspection signal beyond global logit shifts
Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is appliedfinding0.789
The misleadingly high result that prior paradigm would report as evidence of introspection
Strength comparison accuracy averages 47% at layers 15-30, indistinguishable from 50% chancefinding0.771
Shows collapse of introspective capability at later layers in the strength comparison task
Correlation between layer-wise scores and task accuracy ρ = −0.73 (p < 0.001) on LLaMAfinding0.762
Core E3 finding validating S as a predictor of anchoring effectiveness
Feature pair A/1/3949 and B/1/3321 have activation correlation 0.98 but negative logit weight correlation, firing on PLOSOne journal abbreviationsfinding0.762
Demonstrates that activation similarity can diverge from logit weight similarity due to interference
Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.760
Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude