finding
active
finding:at-layer-0-5-detection-adjusted-logit-difference-is-3-19-and-control-increase-is-3-22-a-difference-of-only-0-03-logits

At layer 0 α=5, detection-adjusted logit difference is +3.19 and control increase is +3.22, a difference of only 0.03 logits

Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy

Source paper

extracted_from
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.