claim

active

claim:neural-self-other-overlap-provides-a-hard-to-fake-metric-for-classifying-deceptive-vs-honest-agents

Neural self-other overlap provides a hard-to-fake metric for classifying deceptive vs honest agents

Claim that SOO is particularly useful as a detection metric because it is based on latent representations rather than observable behavior

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Papers (1)

paper

Towards Safe and Honest AI Agents with Neural Self-Other Overlap
introduces

Methods (1)

method

Latent SOO Metric
supports
Metric measuring the mean MSE between self and other-referencing activations across all hidden MLP/attention layers

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Neural self-other overlap in humans mediates empathy and inversely predicts deceptive behavior, motivating the SOO approach for AIclaim0.865
Cross-domain analogical claim linking neuroscience findings to AI design
Deceptive RL baseline agents have lower mean neural self-other overlap than honest baseline agentsclaim0.840
Core empirical prediction tested in RL experiments, confirmed by 100% classification accuracy
Neural Self-Other Overlap in Neuroscienceconcept0.827
Neuroscientific phenomenon where self and other representations partially converge, linked to empathy and altruism
An artificial model replicating mechanisms of self-illusion can test hypotheses and reveal novel affordances for non-human intelligence.hypothesis0.784
Methodological proposal to integrate knowledge from contemplative and cognitive science into AI/artificial life frameworks.
Self-Other Overlapconcept0.782
The extent to which a model exhibits similar internal representations when reasoning about itself and others in similar contexts
We define Self-Other Overlap (SOO) as the extent to which a model exhibits similar internal representations when reasoning about itself and others in similar contexts.quote0.779
Formal definition of the paper's central construct
When probe and self-report agree and move together causally, confidence in both increases as evidence they track the same underlying stateclaim0.777
Convergent validity logic applied to LLM interpretability; probes validate self-reports and vice versa
Different neural network models trained on different objectives and modalities are converging to a shared statistical model of reality in their representation spaceshypothesis0.775
The central hypothesis of the paper; the platonic representation hypothesis itself