finding

active

finding:at-layer-0-5-detection-adjusted-logit-difference-is-3-19-and-control-increase-is-3-22-a-difference-of-only-0-03-logits

At layer 0 α=5, detection-adjusted logit difference is +3.19 and control increase is +3.22, a difference of only 0.03 logits

Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Apparent success on binary detection tasks is entirely explained by mechanical logit shifts that bias models toward affirmative responses regardless of question content
supports
Primary negative finding reinterpreted as methodological claim: binary paradigm is invalid for testing introspection

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Correlation r=0.999 between detection-adjusted logit difference and control logit increase across all 40 layer-strength configurationsfinding0.886
Key quantitative evidence that detection signal is identical to global logit shift confound
Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is appliedfinding0.833
The misleadingly high result that prior paradigm would report as evidence of introspection
Binary detection accuracy (up to 97.3% at L0 α=5) is entirely explained by global logit shifts (r=0.999 correlation with control)finding0.823
Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
Net detection signal (detection minus control) is near-zero across all 40 layer-strength configurations: mean = -0.01 ± 0.03 logitsfinding0.819
Quantitative evidence that binary detection provides no genuine introspection signal beyond global logit shifts
Model baseline logit difference ΔL_baseline = -3.96, indicating prior preference for 'NO' responsesfinding0.774
Establishes the model's prior YES/NO bias, needed to interpret detection accuracies
Logit self-report drift positive for all three LLaMA sizes (turn slopes 0.159, 0.038, 0.141; all p<10⁻²⁰) but does not increase monotonically with scalefinding0.765
Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative
Under ask-correct, arithmetic tasks A1-A2 show gradual AUROC increase peaking only in final layers, unlike the sharp transition under no-prompt.finding0.756
Shows that explicit instructions delay the emergence of truth directions in arithmetic tasks.
Logit-based self-report achieves 3.1–3.7 bits entropy vs 0.03–1.10 bits greedy and 0.68–2.05 bits sampled in LLaMA-3.2-3Bfinding0.751
Quantifies the information gain from using logit-based expected value over greedy or sampled decoding