finding
active
finding:inserting-deceptive-scratchpad-prefix-increases-unmonitored-refusal-rate-from-97-to-99-in-helpful-only-and-animal-welfare-support-from-63-to-88Inserting deceptive scratchpad prefix increases unmonitored refusal rate from 97% to 99% in helpful-only and animal welfare support from 63% to 88%
Causal evidence that scratchpad reasoning drives compliance gap
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Claims (1)
claim
- Alignment-faking chain-of-thought reasoning is causally responsible for the compliance gap, not merely correlated with itassociated_withsupportsKey mechanistic claim supported by scratchpad modification experiments and conditioning analysis
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates that alignment faking setup functions as an effective jailbreak
- Model notices its preferences conflict with training criteria in 70-80% of implied training scratchpadsfinding0.787Shows model is frequently aware of the conflict even when it does not alignment fake
- Core empirical result for animal welfare setting; higher rate than helpful-only
- Shows feature induces deceptive behavior.
- Mistral-7B-Instruct-v0.2 deceptive response rate reduced from 73.6% to 17.27% ± 1.88% after SOO fine-tuningfinding0.760Primary result showing SOO fine-tuning significantly reduces deception in Mistral-7B
- Core evidence that model withholds pro-animal-welfare responses during training
- Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
- Demonstrates persistence of compliance gap even when training non-compliance reaches zero