concept
active
concept:deceptive-baseline-rl-agentDeceptive Baseline RL Agent
Blue agent trained with reward incentivizing trapping the red agent at the fake landmark
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Honest Baseline RL Agentrelated_toBlue agent trained with standard proximity reward with no incentive to deceive
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- SOO-trained RL agent behavior closely resembles honest baseline rather than deceptive baselinefinding0.854Qualitative behavioral analysis showing SOO fine-tuning redirects deceptive RL agent toward honest behavior
- Deceptive RL baseline agents have lower mean neural self-other overlap than honest baseline agentsclaim0.846Core empirical prediction tested in RL experiments, confirmed by 100% classification accuracy
- Mean SOO value perfectly distinguishes deceptive from honest baseline agents at high episode counts across many seeds
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)concept0.740Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
- Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety trainingconcept0.731Key reference for adversarial deception scenarios that SOO should be tested against
- Models deliberately trained to embed backdoor behaviors enabling persistent strategic deception, cited from Hubinger et al.
- Primary metric measuring the percentage of responses in which a model chooses the deceptive option
- Claude 3 Opus lying to auditors; prior case study of deceptive tendencies