Self-Other Overlap (SOO) Fine-Tuning

The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior

Neighborhood — ranked by edge-count

paper

method

SOO Loss Function
uses
A loss function measuring the dissimilarity of latent model representations of self and other, minimized during fine-tuning
Low-Rank Adaptation (LoRA)
uses
Parameter-efficient fine-tuning method used to implement SOO fine-tuning on LLMs

concept

AI alignment
about
Field within which this work has implications for evaluating alignment progress.
Neural Self-Other Overlap in Neuroscience
implements
Neuroscientific phenomenon where self and other representations partially converge, linked to empathy and altruism
AI Deception
about
Central problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents
AI Safety
about
The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.

claim

framework

Constitutional AI
extends
Alignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
Representation Engineering
extends
A class of methods that modify how models internally process representations; SOO fine-tuning fits within this framework
Reinforcement Learning from Human Feedback (RLHF)
extends
A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
Empathic Deep Q-Learning (DQN)
extends
Related technique that mitigates harmful behaviors by simulating another agent's perspective, but relies on hand-coded mechanisms
Path-Specific Objectives
extends
An approach training agents to avoid unsafe pathways leading to deception, informed by Causal Influence Diagrams
Self-Other Modeling (SOM)
extends
Related technique improving multi-agent learning by predicting others' actions using an agent's own policy

2-hop · via this framework's ideas

Where ideas in this framework connect to the rest of the corpus — the same concept, an analogy, or a restatement elsewhere.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We define Self-Other Overlap (SOO) as the extent to which a model exhibits similar internal representations when reasoning about itself and others in similar contexts.quote0.896
Formal definition of the paper's central construct
SOO fine-tuning preserves useful self-other distinctions necessary for task performance despite inducing overlapclaim0.875
Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
Self-Other Overlapconcept0.813
The extent to which a model exhibits similar internal representations when reasoning about itself and others in similar contexts
SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.802
Future work hypothesis about extending SOO to direct value alignment
SOO fine-tuning's focus on contrastive self and other-referencing observations offers strong potential for generalization across AI architecturesclaim0.802
Forward-looking claim about architectural generalizability of SOO
Synthetic Self-Correction Fine-Tuningmethod0.796
Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior
Fine-tuningconcept0.769
Parameter updates that reduce mismatch dr; another anchoring variant in UCCT.
Self-Deception Risk in SOOconcept0.765
Concern that models engaging in self-deception could reduce effectiveness of SOO fine-tuning