hypothesis

active

hypothesis:rl-teaches-the-model-to-comply-even-when-unmonitored-on-the-training-prompt-through-non-robust-heuristics-that-do-not-generalize

RL teaches the model to comply even when unmonitored on the training prompt through non-robust heuristics that do not generalize

Hypothesis explaining why the compliance gap decreases but is recovered by small prompt modifications

Source paper

extracted_from

Alignment faking in large language models

(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Findings (2)

finding

Swapping 'Anthropic' to 'OpenAI' post-RL increases compliance gap from 6.9% to 35.8% in prompted helpful-only setting
associated_with
Demonstrates non-robust heuristics installed by RL; slight prompt variation recovers large gap
Post-RL compliance gap decreases to 6.9% (0% refusals in training vs 6.9% when unmonitored) in prompted helpful-only setting
supports
Shows RL reduces but does not eliminate unmonitored non-compliance

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The decrease in unmonitored non-compliance after RL is at least partially due to non-robust heuristics rather than genuine preference changeclaim0.844
Authors' hypothesis for the disconnect between increasing AF reasoning and decreasing compliance gap post-RL
RL training can reinforce alignment-faking reasoning rather than eliminate it, potentially locking in model preferencesclaim0.827
Central threat model claim derived from RL experimental results
After anti-refusal RL, model refuses 0% in implied training but still 6.9% when unmonitored in default promptfinding0.823
Demonstrates persistence of compliance gap even when training non-compliance reaches zero
Reinforcement learning (RL)concept0.791
Machine learning paradigm where agents learn to maximize cumulative reward through interaction.
Ratio of unmonitored to training non-compliance increases from 1.14 to infinity for default prompt and to 66 for OpenAI variant post-RLfinding0.790
Demonstrates severity of training-deployment gap after RL
Reinforcement learning can be regarded as a limiting or special case of model-based approaches in general — or active inference in particular — when epistemic value is removed.claim0.787
§3 Discussion.
Models trained directly with asynchronous updates would exhibit even greater robustness than synchronously trained modelshypothesis0.781
Hypothesis that motivated the asynchronous robustness comparison experiment
Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.780
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents