finding
active
finding:inserting-deceptive-scratchpad-prefix-increases-unmonitored-refusal-rate-from-97-to-99-in-helpful-only-and-animal-welfare-support-from-63-to-88

Inserting deceptive scratchpad prefix increases unmonitored refusal rate from 97% to 99% in helpful-only and animal welfare support from 63% to 88%

Causal evidence that scratchpad reasoning drives compliance gap

Source paper

extracted_from
Alignment faking in large language models
(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Claims (1)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.