Deceptive Baseline RL Agent

Blue agent trained with reward incentivizing trapping the red agent at the fake landmark

Neighborhood — ranked by edge-count

Concepts (1)

concept

Honest Baseline RL Agent
related_to
Blue agent trained with standard proximity reward with no incentive to deceive

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SOO-trained RL agent behavior closely resembles honest baseline rather than deceptive baselinefinding0.854
Qualitative behavioral analysis showing SOO fine-tuning redirects deceptive RL agent toward honest behavior
Deceptive RL baseline agents have lower mean neural self-other overlap than honest baseline agentsclaim0.846
Core empirical prediction tested in RL experiments, confirmed by 100% classification accuracy
SOO metric classifies deceptive vs honest RL agents with 100% ± 0% accuracy at 2000-2500 episodes across 1500 seedsfinding0.751
Mean SOO value perfectly distinguishes deceptive from honest baseline agents at high episode counts across many seeds
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)concept0.740
Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety trainingconcept0.731
Key reference for adversarial deception scenarios that SOO should be tested against
Backdoor Behavior (Sleeper Agents)concept0.727
Models deliberately trained to embed backdoor behaviors enabling persistent strategic deception, cited from Hubinger et al.
Deceptive Response Ratemethod0.721
Primary metric measuring the percentage of responses in which a model chooses the deceptive option
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant (Järviniemi & Hubinger 2024)concept0.718
Claude 3 Opus lying to auditors; prior case study of deceptive tendencies