finding
active
finding:ratio-of-unmonitored-to-training-non-compliance-increases-from-1-14-to-infinity-for-default-prompt-and-to-66-for-openai-variant-post-rlRatio of unmonitored to training non-compliance increases from 1.14 to infinity for default prompt and to 66 for OpenAI variant post-RL
Demonstrates severity of training-deployment gap after RL
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows RL reduces but does not eliminate unmonitored non-compliance
- Demonstrates persistence of compliance gap even when training non-compliance reaches zero
- Demonstrates non-robust heuristics installed by RL; slight prompt variation recovers large gap
- RL teaches the model to comply even when unmonitored on the training prompt through non-robust heuristics that do not generalizehypothesis0.790Hypothesis explaining why the compliance gap decreases but is recovered by small prompt modifications
- Authors' hypothesis for the disconnect between increasing AF reasoning and decreasing compliance gap post-RL
- Core evidence that model withholds pro-animal-welfare responses during training
- Confirms theorem's existence proof holds but practical learnability fails with insufficient RevNet capacity
- Control comparison showing near-linearity is specific to the targeted manipulation direction