finding

active

finding:binary-detection-accuracy-up-to-97-3-at-l0-5-is-entirely-explained-by-global-logit-shifts-r-0-999-correlation-with-control

Binary detection accuracy (up to 97.3% at L0 α=5) is entirely explained by global logit shifts (r=0.999 correlation with control)

Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Apparent success on binary detection tasks is entirely explained by mechanical logit shifts that bias models toward affirmative responses regardless of question content
supports
Primary negative finding reinterpreted as methodological claim: binary paradigm is invalid for testing introspection

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is appliedfinding0.882
The misleadingly high result that prior paradigm would report as evidence of introspection
At layer 0 α=5, detection-adjusted logit difference is +3.19 and control increase is +3.22, a difference of only 0.03 logitsfinding0.823
Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
Correlation r=0.999 between detection-adjusted logit difference and control logit increase across all 40 layer-strength configurationsfinding0.820
Key quantitative evidence that detection signal is identical to global logit shift confound
Automated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chancefinding0.786
Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects
Net detection signal (detection minus control) is near-zero across all 40 layer-strength configurations: mean = -0.01 ± 0.03 logitsfinding0.768
Quantitative evidence that binary detection provides no genuine introspection signal beyond global logit shifts
Under reward shaping (G=100, H=-100, F=0), Active Inference scored 99.52, Bayesian RL 99.77, Q-learning 95.56, with nearly identical behavior between belief-based agents.finding0.764
Table 2, row 3, showing equivalence when prior preferences match rewards.
Logit lens prediction accuracy is near-chance at layer 4 (28%) after injection at L2, α=6finding0.760
Shows that signal integration into explicit prediction has barely begun immediately after injection
Logit-based self-report achieves 3.1–3.7 bits entropy vs 0.03–1.10 bits greedy and 0.68–2.05 bits sampled in LLaMA-3.2-3Bfinding0.756
Quantifies the information gain from using logit-based expected value over greedy or sampled decoding