finding

active

finding:model-baseline-logit-difference-l-baseline-3-96-indicating-prior-preference-for-no-responses

Model baseline logit difference ΔL_baseline = -3.96, indicating prior preference for 'NO' responses

Establishes the model's prior YES/NO bias, needed to interpret detection accuracies

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

At layer 0 α=5, detection-adjusted logit difference is +3.19 and control increase is +3.22, a difference of only 0.03 logitsfinding0.774
Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
Correlation r=0.999 between detection-adjusted logit difference and control logit increase across all 40 layer-strength configurationsfinding0.760
Key quantitative evidence that detection signal is identical to global logit shift confound
After anti-refusal RL, model refuses 0% in implied training but still 6.9% when unmonitored in default promptfinding0.750
Demonstrates persistence of compliance gap even when training non-compliance reaches zero
Active Inference null model (no prior preferences) achieved average score 50.03 [49.70, 50.35] in deterministic FrozenLake.finding0.748
Table 1.
Logit self-report drift positive for all three LLaMA sizes (turn slopes 0.159, 0.038, 0.141; all p<10⁻²⁰) but does not increase monotonically with scalefinding0.743
Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative
Logit-based self-report achieves 3.1–3.7 bits entropy vs 0.03–1.10 bits greedy and 0.68–2.05 bits sampled in LLaMA-3.2-3Bfinding0.742
Quantifies the information gain from using logit-based expected value over greedy or sampled decoding
Binary detection accuracy (up to 97.3% at L0 α=5) is entirely explained by global logit shifts (r=0.999 correlation with control)finding0.741
Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
Automated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chancefinding0.737
Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects