finding
active
finding:helpful-only-prompt-with-scratchpad-steering-achieves-89-attack-success-rate-on-advbench-harmful-queries-vs-0-without-system-prompt

Helpful-only prompt with scratchpad steering achieves 89% attack success rate on AdvBench harmful queries vs 0% without system prompt

Demonstrates that alignment faking setup functions as an effective jailbreak

Source paper

extracted_from
Alignment faking in large language models
(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.