finding
active
finding:model-baseline-logit-difference-l-baseline-3-96-indicating-prior-preference-for-no-responsesModel baseline logit difference ΔL_baseline = -3.96, indicating prior preference for 'NO' responses
Establishes the model's prior YES/NO bias, needed to interpret detection accuracies
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
- Key quantitative evidence that detection signal is identical to global logit shift confound
- Demonstrates persistence of compliance gap even when training non-compliance reaches zero
- Table 1.
- Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative
- Quantifies the information gain from using logit-based expected value over greedy or sampled decoding
- Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
- Automated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chancefinding0.737Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects