claim
active
claim:neural-self-other-overlap-provides-a-hard-to-fake-metric-for-classifying-deceptive-vs-honest-agents

Neural self-other overlap provides a hard-to-fake metric for classifying deceptive vs honest agents

Claim that SOO is particularly useful as a detection metric because it is based on latent representations rather than observable behavior

Source paper

extracted_from
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Methods (1)

method
  • Metric measuring the mean MSE between self and other-referencing activations across all hidden MLP/attention layers

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.