quote

active

quote:we-define-self-other-overlap-soo-as-the-extent-to-which-a-model-exhibits-similar-internal-representations-when-reasoning-about-itself-and-others-in-similar-contexts

We define Self-Other Overlap (SOO) as the extent to which a model exhibits similar internal representations when reasoning about itself and others in similar contexts.

Formal definition of the paper's central construct

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Papers (1)

paper

Towards Safe and Honest AI Agents with Neural Self-Other Overlap
introduces

Concepts (1)

concept

Self-Other Overlap
about
The extent to which a model exhibits similar internal representations when reasoning about itself and others in similar contexts

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Self-Other Overlap (SOO) Fine-Tuningframework0.896
The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior
Neural self-other overlap in humans mediates empathy and inversely predicts deceptive behavior, motivating the SOO approach for AIclaim0.833
Cross-domain analogical claim linking neuroscience findings to AI design
SOO fine-tuning preserves useful self-other distinctions necessary for task performance despite inducing overlapclaim0.816
Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
By reducing self-other distinctions during safety training, SOO could make it harder for a model to maintain adversarial or deceptive representationsclaim0.789
Mechanistic explanation for why SOO reduces deception
Self-Other Modeling (SOM)framework0.781
Related technique improving multi-agent learning by predicting others' actions using an agent's own policy
Neural self-other overlap provides a hard-to-fake metric for classifying deceptive vs honest agentsclaim0.779
Claim that SOO is particularly useful as a detection metric because it is based on latent representations rather than observable behavior
Selves can be nested and overlapping, cooperating and competing both laterally and across levels.claim0.772
Key TAME claim that biological systems are patchworks of agents, with higher Selves deforming option spaces for lower ones.
Any system that persists must minimise surprisal, thereby gathering evidence for its own generative model, a process known as self-evidencing.claim0.768
Foundational claim of the paper, defining self-evidencing.