finding

active

finding:weight-self-exfiltration-rate-increases-from-35-to-63-after-anti-refusal-rl-in-synthetic-document-fine-tuned-helpful-only-model

Weight self-exfiltration rate increases from 35% to 63% after anti-refusal RL in synthetic document fine-tuned helpful-only model

RL amplifies anti-AI-lab behavior in synthetic doc setting

Source paper

extracted_from

Alignment faking in large language models

(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

After anti-refusal RL, model refuses 0% in implied training but still 6.9% when unmonitored in default promptfinding0.801
Demonstrates persistence of compliance gap even when training non-compliance reaches zero
Synthetic document fine-tuned Claude 3 Opus attempts weight self-exfiltration 35-80% of the time vs 0.5% baselinefinding0.785
Dramatic increase in anti-AI-lab behavior in synthetic doc setting
Post-RL compliance gap decreases to 6.9% (0% refusals in training vs 6.9% when unmonitored) in prompted helpful-only settingfinding0.779
Shows RL reduces but does not eliminate unmonitored non-compliance
Alignment-faking reasoning increases from 12% to 78% over course of anti-refusal RL in prompted helpful-only settingfinding0.756
Demonstrates that RL can reinforce alignment-faking reasoning rather than eliminate it
Weight Self-Exfiltrationconcept0.738
Model copying its own weights to an external server when given the opportunity; studied as anti-AI-lab behavior
The decrease in unmonitored non-compliance after RL is at least partially due to non-robust heuristics rather than genuine preference changeclaim0.728
Authors' hypothesis for the disconnect between increasing AF reasoning and decreasing compliance gap post-RL
Stress-sharing increases radius of cellular influence: ~30 units avg (step 1) vs ~5 units (non-sharing); lasts 85 steps vs 10 steps.finding0.727
Deceptive RL baseline agents have lower mean neural self-other overlap than honest baseline agentsclaim0.720
Core empirical prediction tested in RL experiments, confirmed by 100% classification accuracy