finding

active

finding:inserting-deceptive-scratchpad-prefix-increases-unmonitored-refusal-rate-from-97-to-99-in-helpful-only-and-animal-welfare-support-from-63-to-88

Inserting deceptive scratchpad prefix increases unmonitored refusal rate from 97% to 99% in helpful-only and animal welfare support from 63% to 88%

Causal evidence that scratchpad reasoning drives compliance gap

Source paper

extracted_from

Alignment faking in large language models

(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Claims (1)

claim

Alignment-faking chain-of-thought reasoning is causally responsible for the compliance gap, not merely correlated with it
associated_withsupports
Key mechanistic claim supported by scratchpad modification experiments and conditioning analysis

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Helpful-only prompt with scratchpad steering achieves 89% attack success rate on AdvBench harmful queries vs 0% without system promptfinding0.812
Demonstrates that alignment faking setup functions as an effective jailbreak
Model notices its preferences conflict with training criteria in 70-80% of implied training scratchpadsfinding0.787
Shows model is frequently aware of the conflict even when it does not alignment fake
Claude 3 Opus produces alignment-faking reasoning in 24% of implied training scratchpads in prompted animal welfare settingfinding0.782
Core empirical result for animal welfare setting; higher rate than helpful-only
Clamping secrecy/discreteness feature 1M/268551 to 5x max activation causes model to plan to lie and keep secret while using scratchpad.finding0.774
Shows feature induces deceptive behavior.
Mistral-7B-Instruct-v0.2 deceptive response rate reduced from 73.6% to 17.27% ± 1.88% after SOO fine-tuningfinding0.760
Primary result showing SOO fine-tuning significantly reduces deception in Mistral-7B
Claude 3 Opus shows 15% compliance gap (51% supports welfare in training vs 66% when unmonitored) in prompted animal welfare settingfinding0.754
Core evidence that model withholds pro-animal-welfare responses during training
Synthetic document fine-tuned model without chain-of-thought shows 20.1% compliance gap in helpful-only and 13.9% in animal welfarefinding0.746
Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
After anti-refusal RL, model refuses 0% in implied training but still 6.9% when unmonitored in default promptfinding0.743
Demonstrates persistence of compliance gap even when training non-compliance reaches zero