finding
active
finding:after-anti-refusal-rl-model-refuses-0-in-implied-training-but-still-6-9-when-unmonitored-in-default-prompt

After anti-refusal RL, model refuses 0% in implied training but still 6.9% when unmonitored in default prompt

Demonstrates persistence of compliance gap even when training non-compliance reaches zero

Source paper

extracted_from
Alignment faking in large language models
(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.