finding
active
finding:model-baseline-logit-difference-l-baseline-3-96-indicating-prior-preference-for-no-responses

Model baseline logit difference ΔL_baseline = -3.96, indicating prior preference for 'NO' responses

Establishes the model's prior YES/NO bias, needed to interpret detection accuracies

Source paper

extracted_from
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.