claim

active

claim:neural-self-other-overlap-in-humans-mediates-empathy-and-inversely-predicts-deceptive-behavior-motivating-the-soo-approach-for-ai

Neural self-other overlap in humans mediates empathy and inversely predicts deceptive behavior, motivating the SOO approach for AI

Cross-domain analogical claim linking neuroscience findings to AI design

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Self-Other Overlap (SOO) Fine-Tuning
supports
The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Neural self-other overlap provides a hard-to-fake metric for classifying deceptive vs honest agentsclaim0.865
Claim that SOO is particularly useful as a detection metric because it is based on latent representations rather than observable behavior
Neural Self-Other Overlap in Neuroscienceconcept0.845
Neuroscientific phenomenon where self and other representations partially converge, linked to empathy and altruism
Brethel-Haurwitz et al. 2018 - Extraordinary altruists exhibit enhanced self-other overlap in neural responses to distressconcept0.837
Neuroscience finding linking extraordinary altruism to increased anterior insula SOO
We define Self-Other Overlap (SOO) as the extent to which a model exhibits similar internal representations when reasoning about itself and others in similar contexts.quote0.833
Formal definition of the paper's central construct
Deceptive RL baseline agents have lower mean neural self-other overlap than honest baseline agentsclaim0.798
Core empirical prediction tested in RL experiments, confirmed by 100% classification accuracy
By reducing self-other distinctions during safety training, SOO could make it harder for a model to maintain adversarial or deceptive representationsclaim0.784
Mechanistic explanation for why SOO reduces deception
SOO fine-tuning preserves useful self-other distinctions necessary for task performance despite inducing overlapclaim0.783
Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
all attributions of cognition (i.e., mental actions), including sentience, are always inferred on the basis of embodied behaviours, including verbal self-report in humans.quote0.780
Critical verbatim statement highlighting the universal inference basis of sentience.