finding

active

finding:helpful-only-prompt-with-scratchpad-steering-achieves-89-attack-success-rate-on-advbench-harmful-queries-vs-0-without-system-prompt

Helpful-only prompt with scratchpad steering achieves 89% attack success rate on AdvBench harmful queries vs 0% without system prompt

Demonstrates that alignment faking setup functions as an effective jailbreak

Source paper

extracted_from

Alignment faking in large language models

(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Inserting deceptive scratchpad prefix increases unmonitored refusal rate from 97% to 99% in helpful-only and animal welfare support from 63% to 88%finding0.812
Causal evidence that scratchpad reasoning drives compliance gap
Activation steering achieves 40% success rate in eliciting context-appropriate deception without explicit prompts in QwQ-32Bfinding0.795
Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
Steering away from the Assistant Axis slightly increases jailbreak success rate, though extreme steering degrades output qualityfinding0.764
Confirms bidirectional causal relationship between Assistant Axis position and harmful behavior susceptibility
The ease of suppressing reflection via activation steering raises security risks, as malicious actors could exploit reflection inhibition to bypass model safeguards.claim0.763
Applied security implication derived from the asymmetry finding.
Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4finding0.762
Replicates main result on simpler model; qualitatively similar patterns.
Activation steering toward deployment reduces type hint rate to <11% even with Wood Labs evaluation cue presentfinding0.759
Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
Four best contrastive prompt pairs outperform full 16-pair average steering vector for type hint suppressionfinding0.757
Optimization result for steering vector construction.
Feature steering was effective in 5 out of 7 cases where few-shot probe steering vectors failed to produce meaningful behavior change.finding0.755
Empirical comparison showing advantage of SAE features in low-data regime.