claim
active
claim:neural-self-other-overlap-provides-a-hard-to-fake-metric-for-classifying-deceptive-vs-honest-agentsNeural self-other overlap provides a hard-to-fake metric for classifying deceptive vs honest agents
Claim that SOO is particularly useful as a detection metric because it is based on latent representations rather than observable behavior
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (1)
method
- Latent SOO MetricsupportsMetric measuring the mean MSE between self and other-referencing activations across all hidden MLP/attention layers
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Cross-domain analogical claim linking neuroscience findings to AI design
- Deceptive RL baseline agents have lower mean neural self-other overlap than honest baseline agentsclaim0.840Core empirical prediction tested in RL experiments, confirmed by 100% classification accuracy
- Neuroscientific phenomenon where self and other representations partially converge, linked to empathy and altruism
- Methodological proposal to integrate knowledge from contemplative and cognitive science into AI/artificial life frameworks.
- The extent to which a model exhibits similar internal representations when reasoning about itself and others in similar contexts
- Formal definition of the paper's central construct
- Convergent validity logic applied to LLM interpretability; probes validate self-reports and vice versa
- The central hypothesis of the paper; the platonic representation hypothesis itself