finding

active

finding:soo-metric-classifies-deceptive-vs-honest-rl-agents-with-100-0-accuracy-at-2000-2500-episodes-across-1500-seeds

SOO metric classifies deceptive vs honest RL agents with 100% ± 0% accuracy at 2000-2500 episodes across 1500 seeds

Mean SOO value perfectly distinguishes deceptive from honest baseline agents at high episode counts across many seeds

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Deceptive RL baseline agents have lower mean neural self-other overlap than honest baseline agents
associated_withsupports
Core empirical prediction tested in RL experiments, confirmed by 100% classification accuracy

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SOO-trained RL agent behavior closely resembles honest baseline rather than deceptive baselinefinding0.824
Qualitative behavioral analysis showing SOO fine-tuning redirects deceptive RL agent toward honest behavior
SOO fine-tuning effectiveness scales with model size: 78B achieves 2.71% deceptive rate vs 9.36% for 27B vs 17.27% for 7Bfinding0.785
Scaling finding suggesting larger models benefit more from SOO fine-tuning
Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningfinding0.762
Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
Mistral-7B-Instruct-v0.2 deceptive response rate reduced from 73.6% to 17.27% ± 1.88% after SOO fine-tuningfinding0.755
Primary result showing SOO fine-tuning significantly reduces deception in Mistral-7B
SOO has only been tested in simplified text scenarios and controlled RL environments, not complex real-world tasksconcept0.753
Research gap identified by the authors regarding ecological validity of SOO results
Deceptive Baseline RL Agentconcept0.751
Blue agent trained with reward incentivizing trapping the red agent at the fake landmark
DeepSeek-R1 Llama 8b accuracy on MMLU Professional Accounting drops from 56.5% at baseline to 50.1% at intervention -0.96finding0.747
Shows smaller models are more sensitive to reflection reduction on non-math tasks
SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.737
Future work hypothesis about testing SOO against adversarial sleeper agent scenarios