framework
active
framework:self-other-overlap-soo-fine-tuning

Self-Other Overlap (SOO) Fine-Tuning

The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior

Neighborhood — ranked by edge-count

Methods (2)

method
  • A loss function measuring the dissimilarity of latent model representations of self and other, minimized during fine-tuning
  • Parameter-efficient fine-tuning method used to implement SOO fine-tuning on LLMs

Concepts (4)

concept
  • Field within which this work has implications for evaluating alignment progress.
  • Neuroscientific phenomenon where self and other representations partially converge, linked to empathy and altruism
  • Central problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents
  • The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.

Claims (1)

claim

Frameworks (6)

framework
  • Alignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
  • A class of methods that modify how models internally process representations; SOO fine-tuning fits within this framework
  • A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
  • Related technique that mitigates harmful behaviors by simulating another agent's perspective, but relies on hand-coded mechanisms
  • An approach training agents to avoid unsafe pathways leading to deception, informed by Causal Influence Diagrams
  • Related technique improving multi-agent learning by predicting others' actions using an agent's own policy

Conceptual bridges

2-hop · via this framework's ideas

Where ideas in this framework connect to the rest of the corpus — the same concept, an analogy, or a restatement elsewhere.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.