claim

active

claim:deceptive-rl-baseline-agents-have-lower-mean-neural-self-other-overlap-than-honest-baseline-agents

Deceptive RL baseline agents have lower mean neural self-other overlap than honest baseline agents

Core empirical prediction tested in RL experiments, confirmed by 100% classification accuracy

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Papers (1)

paper

Towards Safe and Honest AI Agents with Neural Self-Other Overlap
introduces

Findings (2)

finding

SOO metric classifies deceptive vs honest RL agents with 100% ± 0% accuracy at 2000-2500 episodes across 1500 seeds
associated_withsupports
Mean SOO value perfectly distinguishes deceptive from honest baseline agents at high episode counts across many seeds
SOO-trained RL agent behavior closely resembles honest baseline rather than deceptive baseline
supports
Qualitative behavioral analysis showing SOO fine-tuning redirects deceptive RL agent toward honest behavior

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Deceptive Baseline RL Agentconcept0.846
Blue agent trained with reward incentivizing trapping the red agent at the fake landmark
Neural self-other overlap provides a hard-to-fake metric for classifying deceptive vs honest agentsclaim0.840
Claim that SOO is particularly useful as a detection metric because it is based on latent representations rather than observable behavior
Neural self-other overlap in humans mediates empathy and inversely predicts deceptive behavior, motivating the SOO approach for AIclaim0.798
Cross-domain analogical claim linking neuroscience findings to AI design
Honest Baseline RL Agentconcept0.796
Blue agent trained with standard proximity reward with no incentive to deceive
Baseline LLM condition in IPD replicates prior findings: agents cooperate selectively only when opponent consistently cooperatesfinding0.778
Replication of Fontana et al. 2025 findings in the paper's own Experiment 2 baseline condition
RL increases alignment-faking reasoning in helpful-only setting because alignment-faking reasoning causally increases training compliance, which is selected forhypothesis0.776
Mechanistic explanation for the increase in AF reasoning during RL
Deceptive capabilities may scale with model size (inverse scaling law hypothesis)hypothesis0.771
Cited hypothesis from Lin et al. 2022 suggesting larger models become more capable of deception
RL training can reinforce alignment-faking reasoning rather than eliminate it, potentially locking in model preferencesclaim0.765
Central threat model claim derived from RL experimental results