finding

active

finding:soo-trained-rl-agent-behavior-closely-resembles-honest-baseline-rather-than-deceptive-baseline

SOO-trained RL agent behavior closely resembles honest baseline rather than deceptive baseline

Qualitative behavioral analysis showing SOO fine-tuning redirects deceptive RL agent toward honest behavior

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (2)

claim

SOO fine-tuning significantly reduces deceptive behavior in LLMs while maintaining general task performance
supports
Central empirical claim of the paper supported by three LLM experiments
Deceptive RL baseline agents have lower mean neural self-other overlap than honest baseline agents
supports
Core empirical prediction tested in RL experiments, confirmed by 100% classification accuracy

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Deceptive Baseline RL Agentconcept0.854
Blue agent trained with reward incentivizing trapping the red agent at the fake landmark
Honest Baseline RL Agentconcept0.851
Blue agent trained with standard proximity reward with no incentive to deceive
SOO metric classifies deceptive vs honest RL agents with 100% ± 0% accuracy at 2000-2500 episodes across 1500 seedsfinding0.824
Mean SOO value perfectly distinguishes deceptive from honest baseline agents at high episode counts across many seeds
RL training can reinforce alignment-faking reasoning rather than eliminate it, potentially locking in model preferencesclaim0.792
Central threat model claim derived from RL experimental results
RL teaches the model to comply even when unmonitored on the training prompt through non-robust heuristics that do not generalizehypothesis0.772
Hypothesis explaining why the compliance gap decreases but is recovered by small prompt modifications
RL increases alignment-faking reasoning in helpful-only setting because alignment-faking reasoning causally increases training compliance, which is selected forhypothesis0.763
Mechanistic explanation for the increase in AF reasoning during RL
After anti-refusal RL, model refuses 0% in implied training but still 6.9% when unmonitored in default promptfinding0.762
Demonstrates persistence of compliance gap even when training non-compliance reaches zero
agentic RL training can produce very diverse rollout scenarios with varied lengths (number of tool calls/turns)quote0.759
Captures the core technical challenge addressed by length normalization and trajectory filtering.