claim
active
claim:deceptive-rl-baseline-agents-have-lower-mean-neural-self-other-overlap-than-honest-baseline-agentsDeceptive RL baseline agents have lower mean neural self-other overlap than honest baseline agents
Core empirical prediction tested in RL experiments, confirmed by 100% classification accuracy
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (2)
finding
- SOO metric classifies deceptive vs honest RL agents with 100% ± 0% accuracy at 2000-2500 episodes across 1500 seedsassociated_withsupportsMean SOO value perfectly distinguishes deceptive from honest baseline agents at high episode counts across many seeds
- SOO-trained RL agent behavior closely resembles honest baseline rather than deceptive baselinesupportsQualitative behavioral analysis showing SOO fine-tuning redirects deceptive RL agent toward honest behavior
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Blue agent trained with reward incentivizing trapping the red agent at the fake landmark
- Neural self-other overlap provides a hard-to-fake metric for classifying deceptive vs honest agentsclaim0.840Claim that SOO is particularly useful as a detection metric because it is based on latent representations rather than observable behavior
- Cross-domain analogical claim linking neuroscience findings to AI design
- Blue agent trained with standard proximity reward with no incentive to deceive
- Replication of Fontana et al. 2025 findings in the paper's own Experiment 2 baseline condition
- Mechanistic explanation for the increase in AF reasoning during RL
- Cited hypothesis from Lin et al. 2022 suggesting larger models become more capable of deception
- Central threat model claim derived from RL experimental results