concept
active
concept:self-deception-risk-in-sooSelf-Deception Risk in SOO
Concern that models engaging in self-deception could reduce effectiveness of SOO fine-tuning
Neighborhood — ranked by edge-count
Questions (1)
question
- To what extent does self-deception in AI models affect the effectiveness of SOO fine-tuning?associated_withOpen concern about whether models can learn to self-deceive in ways that undermine SOO
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Mechanistic explanation for why SOO reduces deception
- The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior
- Formal definition of the paper's central construct
- Core Buddhist and philosophical concept: self is constructed, impermanent, and distributable rather than singular and enduring.
- Latent features in LLaMA 3.3 70B SAE that gate consciousness self-reports; suppression increases experience claims, amplification suppresses them
- SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.726Future work hypothesis about testing SOO against adversarial sleeper agent scenarios
- Research gap identified by the authors regarding ecological validity of SOO results
- Title of the paper, encapsulating its central claim.