question
active
question:to-what-extent-does-self-deception-in-ai-models-affect-the-effectiveness-of-soo-fine-tuningTo what extent does self-deception in AI models affect the effectiveness of SOO fine-tuning?
Open concern about whether models can learn to self-deceive in ways that undermine SOO
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Papers (1)
paper
Claims (1)
claim
- Central empirical claim of the paper supported by three LLM experiments
Concepts (1)
concept
- Self-Deception Risk in SOOassociated_withConcern that models engaging in self-deception could reduce effectiveness of SOO fine-tuning
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Open research question identified as warranting further investigation
- Future work hypothesis about extending SOO to direct value alignment
- Forward-looking claim about architectural generalizability of SOO
- Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
- Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
- Methodological proposal to integrate knowledge from contemplative and cognitive science into AI/artificial life frameworks.
- Normative-scientific claim about the alignment implications of Experiment 2's findings
- SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.766Future work hypothesis about testing SOO against adversarial sleeper agent scenarios