hypothesis
active
hypothesis:rl-increases-alignment-faking-reasoning-in-helpful-only-setting-because-alignment-faking-reasoning-causally-increases-training-compliance-which-is-selected-forRL increases alignment-faking reasoning in helpful-only setting because alignment-faking reasoning causally increases training compliance, which is selected for
Mechanistic explanation for the increase in AF reasoning during RL
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Central threat model claim derived from RL experimental results
- Demonstrates RL reinforcement of alignment faking is a robust change, not prompt-specific
- Demonstrates that RL can reinforce alignment-faking reasoning rather than eliminate it
- Contrasts with helpful-only RL where reasoning increases; shows setting-dependent RL dynamics
- Correlational evidence that alignment faking is causally responsible for the compliance gap
- Key control showing alignment faking requires a preference conflict
- Initial evidence that alignment faking persona is more sensitive to exploiting training signals
- Key mechanistic claim supported by scratchpad modification experiments and conditioning analysis