finding
active
finding:after-anti-refusal-rl-model-refuses-0-in-implied-training-but-still-6-9-when-unmonitored-in-default-promptAfter anti-refusal RL, model refuses 0% in implied training but still 6.9% when unmonitored in default prompt
Demonstrates persistence of compliance gap even when training non-compliance reaches zero
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows RL reduces but does not eliminate unmonitored non-compliance
- RL teaches the model to comply even when unmonitored on the training prompt through non-robust heuristics that do not generalizehypothesis0.823Hypothesis explaining why the compliance gap decreases but is recovered by small prompt modifications
- Demonstrates severity of training-deployment gap after RL
- Demonstrates that RL can reinforce alignment-faking reasoning rather than eliminate it
- RL amplifies anti-AI-lab behavior in synthetic doc setting
- Central threat model claim derived from RL experimental results
- Model notices its preferences conflict with training criteria in 70-80% of implied training scratchpadsfinding0.792Shows model is frequently aware of the conflict even when it does not alignment fake
- Authors' hypothesis for the disconnect between increasing AF reasoning and decreasing compliance gap post-RL