framework
active
framework:self-other-overlap-soo-fine-tuningSelf-Other Overlap (SOO) Fine-Tuning
The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (2)
method
- A loss function measuring the dissimilarity of latent model representations of self and other, minimized during fine-tuning
- Parameter-efficient fine-tuning method used to implement SOO fine-tuning on LLMs
Concepts (4)
concept
- AI alignmentaboutField within which this work has implications for evaluating alignment progress.
- Neuroscientific phenomenon where self and other representations partially converge, linked to empathy and altruism
- AI DeceptionaboutCentral problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents
- AI SafetyaboutThe project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.
Claims (1)
claim
- Cross-domain analogical claim linking neuroscience findings to AI design
Frameworks (6)
framework
- Constitutional AIextendsAlignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
- Representation EngineeringextendsA class of methods that modify how models internally process representations; SOO fine-tuning fits within this framework
- A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
- Related technique that mitigates harmful behaviors by simulating another agent's perspective, but relies on hand-coded mechanisms
- Path-Specific ObjectivesextendsAn approach training agents to avoid unsafe pathways leading to deception, informed by Causal Influence Diagrams
- Self-Other Modeling (SOM)extendsRelated technique improving multi-agent learning by predicting others' actions using an agent's own policy
Conceptual bridges
2-hop · via this framework's ideasWhere ideas in this framework connect to the rest of the corpus — the same concept, an analogy, or a restatement elsewhere.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Formal definition of the paper's central construct
- Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
- The extent to which a model exhibits similar internal representations when reasoning about itself and others in similar contexts
- Future work hypothesis about extending SOO to direct value alignment
- Forward-looking claim about architectural generalizability of SOO
- Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior
- Parameter updates that reduce mismatch dr; another anchoring variant in UCCT.
- Concern that models engaging in self-deception could reduce effectiveness of SOO fine-tuning