claim

active

claim:soo-fine-tuning-preserves-useful-self-other-distinctions-necessary-for-task-performance-despite-inducing-overlap

SOO fine-tuning preserves useful self-other distinctions necessary for task performance despite inducing overlap

Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Findings (3)

finding

Mistral-7B Perspectives accuracy remains 100% after SOO fine-tuning
associated_withsupports
SOO fine-tuning did not collapse Mistral-7B self-other distinction needed for perspective-taking
CalmeRys-78B Perspectives accuracy slightly reduced to 95.2% ± 2.21% after SOO fine-tuning
supports
SOO fine-tuning caused slight reduction in perspective-taking accuracy for the largest model
Gemma-2-27B Perspectives accuracy remains 100% after SOO fine-tuning
supports
SOO fine-tuning did not collapse Gemma-2-27B self-other distinction needed for perspective-taking

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Self-Other Overlap (SOO) Fine-Tuningframework0.875
The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior
SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.864
Future work hypothesis about extending SOO to direct value alignment
SOO fine-tuning's focus on contrastive self and other-referencing observations offers strong potential for generalization across AI architecturesclaim0.862
Forward-looking claim about architectural generalizability of SOO
SOO fine-tuning significantly reduces deceptive behavior in LLMs while maintaining general task performanceclaim0.841
Central empirical claim of the paper supported by three LLM experiments
Fine-tuning induces the behavioral pattern of self-correction but does not improve the underlying ability to correct effectivelyclaim0.827
Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
We define Self-Other Overlap (SOO) as the extent to which a model exhibits similar internal representations when reasoning about itself and others in similar contexts.quote0.816
Formal definition of the paper's central construct
SOO fine-tuning could complement RLHF and Constitutional AI by fostering internal coherence that promotes honest behaviorsclaim0.810
Integration claim positioning SOO as additive to existing alignment approaches
Fine-tuning can be likened to imposing a kind of censorship on the simulator; it leaves the underlying range of roles essentially the same but compromises authenticityclaim0.795
Extends the role-play framing to explain the effect of RLHF on dialogue agents