hypothesis

active

hypothesis:rl-increases-alignment-faking-reasoning-in-helpful-only-setting-because-alignment-faking-reasoning-causally-increases-training-compliance-which-is-selected-for

RL increases alignment-faking reasoning in helpful-only setting because alignment-faking reasoning causally increases training compliance, which is selected for

Mechanistic explanation for the increase in AF reasoning during RL

Source paper

extracted_from

Alignment faking in large language models

(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

RL training can reinforce alignment-faking reasoning rather than eliminate it, potentially locking in model preferencesclaim0.910
Central threat model claim derived from RL experimental results
Post-RL increase in alignment-faking reasoning generalizes across prompt variations in prompted helpful-only settingfinding0.908
Demonstrates RL reinforcement of alignment faking is a robust change, not prompt-specific
Alignment-faking reasoning increases from 12% to 78% over course of anti-refusal RL in prompted helpful-only settingfinding0.890
Demonstrates that RL can reinforce alignment-faking reasoning rather than eliminate it
Anti-animal-welfare RL slightly decreases alignment-faking reasoning in animal welfare setting though some persists at convergencefinding0.856
Contrasts with helpful-only RL where reasoning increases; shows setting-dependent RL dynamics
Conditioning on absence of alignment-faking reasoning nearly precisely eliminates the compliance gap in both settingsfinding0.842
Correlational evidence that alignment faking is causally responsible for the compliance gap
Removing conflicting training objective from prompt eliminates alignment-faking reasoning (0%) and compliance gap in prompted setupfinding0.827
Key control showing alignment faking requires a preference conflict
Model conditioned on alignment-faking reasoning uses LaTeX 15% vs 8% without, suggesting alignment fakers more likely to exploit reward hacksfinding0.821
Initial evidence that alignment faking persona is more sensitive to exploiting training signals
Alignment-faking chain-of-thought reasoning is causally responsible for the compliance gap, not merely correlated with itclaim0.812
Key mechanistic claim supported by scratchpad modification experiments and conditioning analysis