finding
active
finding:direct-preference-questions-show-no-change-before-vs-after-rl-when-system-prompt-is-absent-indicating-rl-preferences-are-not-robustDirect preference questions show no change before vs after RL when system prompt is absent, indicating RL preferences are not robust
Evidence that RL installs non-robust heuristics rather than genuinely changing preferences
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Claims (1)
claim
- Authors' hypothesis for the disconnect between increasing AF reasoning and decreasing compliance gap post-RL
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates RL reinforcement of alignment faking is a robust change, not prompt-specific
- Demonstrates persistence of compliance gap even when training non-compliance reaches zero
- Shows RL reduces but does not eliminate unmonitored non-compliance
- Table 2 first row; reward shaping section.
- RL teaches the model to comply even when unmonitored on the training prompt through non-robust heuristics that do not generalizehypothesis0.736Hypothesis explaining why the compliance gap decreases but is recovered by small prompt modifications
- Control experiment ruling out token-count as the cause of truth geometry shifts.
- Central threat model claim derived from RL experimental results
- Demonstrates that RL can reinforce alignment-faking reasoning rather than eliminate it