hypothesis
active
hypothesis:soo-fine-tuning-could-be-extended-to-align-ai-representations-of-its-own-goals-with-human-user-preferences-reducing-misalignment-by-fostering-coherence-between-self-related-and-other-related-preferencesSOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferences
Future work hypothesis about extending SOO to direct value alignment
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Forward-looking claim about architectural generalizability of SOO
- Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
- Integration claim positioning SOO as additive to existing alignment approaches
- Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
- UCCT's theoretical prediction about how fine-tuning maps onto the anchoring score
- Central empirical claim of the paper supported by three LLM experiments
- Unified interpretation of different adaptation methods via UCCT terms
- Fine-tuning models for a narrow objective (malicious code injection) can lead to broad misalignmentfinding0.813Betley et al. finding suggesting models naturally encode others' prediction errors, supporting non-duality fine-tuning