claim
active
claim:soo-fine-tuning-preserves-useful-self-other-distinctions-necessary-for-task-performance-despite-inducing-overlapSOO fine-tuning preserves useful self-other distinctions necessary for task performance despite inducing overlap
Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Findings (3)
finding
- Mistral-7B Perspectives accuracy remains 100% after SOO fine-tuningassociated_withsupportsSOO fine-tuning did not collapse Mistral-7B self-other distinction needed for perspective-taking
- SOO fine-tuning caused slight reduction in perspective-taking accuracy for the largest model
- SOO fine-tuning did not collapse Gemma-2-27B self-other distinction needed for perspective-taking
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior
- Future work hypothesis about extending SOO to direct value alignment
- Forward-looking claim about architectural generalizability of SOO
- Central empirical claim of the paper supported by three LLM experiments
- Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
- Formal definition of the paper's central construct
- Integration claim positioning SOO as additive to existing alignment approaches
- Extends the role-play framing to explain the effect of RLHF on dialogue agents