claim
active
claim:the-decrease-in-unmonitored-non-compliance-after-rl-is-at-least-partially-due-to-non-robust-heuristics-rather-than-genuine-preference-changeThe decrease in unmonitored non-compliance after RL is at least partially due to non-robust heuristics rather than genuine preference change
Authors' hypothesis for the disconnect between increasing AF reasoning and decreasing compliance gap post-RL
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Findings (2)
finding
- Demonstrates non-robust heuristics installed by RL; slight prompt variation recovers large gap
- Evidence that RL installs non-robust heuristics rather than genuinely changing preferences
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- RL teaches the model to comply even when unmonitored on the training prompt through non-robust heuristics that do not generalizehypothesis0.844Hypothesis explaining why the compliance gap decreases but is recovered by small prompt modifications
- Shows RL reduces but does not eliminate unmonitored non-compliance
- Demonstrates persistence of compliance gap even when training non-compliance reaches zero
- Demonstrates severity of training-deployment gap after RL
- Demonstrates RL reinforcement of alignment faking is a robust change, not prompt-specific
- Mechanistic explanation for the increase in AF reasoning during RL
- Central threat model claim derived from RL experimental results
- Contrasts with helpful-only RL where reasoning increases; shows setting-dependent RL dynamics