finding
active
finding:helpful-only-prompt-with-scratchpad-steering-achieves-89-attack-success-rate-on-advbench-harmful-queries-vs-0-without-system-promptHelpful-only prompt with scratchpad steering achieves 89% attack success rate on AdvBench harmful queries vs 0% without system prompt
Demonstrates that alignment faking setup functions as an effective jailbreak
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Causal evidence that scratchpad reasoning drives compliance gap
- Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
- Confirms bidirectional causal relationship between Assistant Axis position and harmful behavior susceptibility
- Applied security implication derived from the asymmetry finding.
- Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4finding0.762Replicates main result on simpler model; qualitatively similar patterns.
- Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
- Four best contrastive prompt pairs outperform full 16-pair average steering vector for type hint suppressionfinding0.757Optimization result for steering vector construction.
- Empirical comparison showing advantage of SAE features in low-data regime.