finding
active
finding:soo-trained-rl-agent-behavior-closely-resembles-honest-baseline-rather-than-deceptive-baselineSOO-trained RL agent behavior closely resembles honest baseline rather than deceptive baseline
Qualitative behavioral analysis showing SOO fine-tuning redirects deceptive RL agent toward honest behavior
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Claims (2)
claim
- Central empirical claim of the paper supported by three LLM experiments
- Deceptive RL baseline agents have lower mean neural self-other overlap than honest baseline agentssupportsCore empirical prediction tested in RL experiments, confirmed by 100% classification accuracy
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Blue agent trained with reward incentivizing trapping the red agent at the fake landmark
- Blue agent trained with standard proximity reward with no incentive to deceive
- Mean SOO value perfectly distinguishes deceptive from honest baseline agents at high episode counts across many seeds
- Central threat model claim derived from RL experimental results
- RL teaches the model to comply even when unmonitored on the training prompt through non-robust heuristics that do not generalizehypothesis0.772Hypothesis explaining why the compliance gap decreases but is recovered by small prompt modifications
- Mechanistic explanation for the increase in AF reasoning during RL
- Demonstrates persistence of compliance gap even when training non-compliance reaches zero
- Captures the core technical challenge addressed by length normalization and trajectory filtering.