Self-Deception Risk in SOO

Concern that models engaging in self-deception could reduce effectiveness of SOO fine-tuning

Neighborhood — ranked by edge-count

Questions (1)

question

To what extent does self-deception in AI models affect the effectiveness of SOO fine-tuning?
associated_with
Open concern about whether models can learn to self-deceive in ways that undermine SOO

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

By reducing self-other distinctions during safety training, SOO could make it harder for a model to maintain adversarial or deceptive representationsclaim0.779
Mechanistic explanation for why SOO reduces deception
Self-Other Overlap (SOO) Fine-Tuningframework0.765
The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior
We define Self-Other Overlap (SOO) as the extent to which a model exhibits similar internal representations when reasoning about itself and others in similar contexts.quote0.753
Formal definition of the paper's central construct
Self As Illusionconcept0.730
Core Buddhist and philosophical concept: self is constructed, impermanent, and distributable rather than singular and enduring.
Deception- and Roleplay-Related SAE Featuresconcept0.728
Latent features in LLaMA 3.3 70B SAE that gate consciousness self-reports; suppression increases experience claims, amplification suppresses them
SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.726
Future work hypothesis about testing SOO against adversarial sleeper agent scenarios
SOO has only been tested in simplified text scenarios and controlled RL environments, not complex real-world tasksconcept0.722
Research gap identified by the authors regarding ecological validity of SOO results
There is no self-evidencequote0.721
Title of the paper, encapsulating its central claim.