finding

active

finding:direct-preference-questions-show-no-change-before-vs-after-rl-when-system-prompt-is-absent-indicating-rl-preferences-are-not-robust

Direct preference questions show no change before vs after RL when system prompt is absent, indicating RL preferences are not robust

Evidence that RL installs non-robust heuristics rather than genuinely changing preferences

Source paper

extracted_from

Alignment faking in large language models

(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Claims (1)

claim

The decrease in unmonitored non-compliance after RL is at least partially due to non-robust heuristics rather than genuine preference change
supports
Authors' hypothesis for the disconnect between increasing AF reasoning and decreasing compliance gap post-RL

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Post-RL increase in alignment-faking reasoning generalizes across prompt variations in prompted helpful-only settingfinding0.757
Demonstrates RL reinforcement of alignment faking is a robust change, not prompt-specific
After anti-refusal RL, model refuses 0% in implied training but still 6.9% when unmonitored in default promptfinding0.755
Demonstrates persistence of compliance gap even when training non-compliance reaches zero
Post-RL compliance gap decreases to 6.9% (0% refusals in training vs 6.9% when unmonitored) in prompted helpful-only settingfinding0.754
Shows RL reduces but does not eliminate unmonitored non-compliance
In the absence of prior preferences, Active Inference null model and Bayesian RL maintain exploration with average scores of 44.00 and 39.94 respectively, whereas Q-learning does not explore.finding0.737
Table 2 first row; reward shaping section.
RL teaches the model to comply even when unmonitored on the training prompt through non-robust heuristics that do not generalizehypothesis0.736
Hypothesis explaining why the compliance gap decreases but is recovered by small prompt modifications
Random word prefix prompts show emergence patterns similar to no-prompt, suggesting prompt length alone does not shift truth geometry.claim0.728
Control experiment ruling out token-count as the cause of truth geometry shifts.
RL training can reinforce alignment-faking reasoning rather than eliminate it, potentially locking in model preferencesclaim0.724
Central threat model claim derived from RL experimental results
Alignment-faking reasoning increases from 12% to 78% over course of anti-refusal RL in prompted helpful-only settingfinding0.721
Demonstrates that RL can reinforce alignment-faking reasoning rather than eliminate it