question
active
question:how-robust-is-soo-fine-tuning-against-adversarial-settings-such-as-sleeper-agent-scenariosHow robust is SOO fine-tuning against adversarial settings such as sleeper agent scenarios?
Open research question about SOO's effectiveness against sophisticated deception
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Papers (1)
paper
Hypotheses (1)
hypothesis
- Future work hypothesis about testing SOO against adversarial sleeper agent scenarios
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Future work hypothesis about extending SOO to direct value alignment
- Forward-looking claim about architectural generalizability of SOO
- Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
- Central empirical claim of the paper supported by three LLM experiments
- Open research question identified as warranting further investigation
- Research gap identified by the authors regarding ecological validity of SOO results
- Integration claim positioning SOO as additive to existing alignment approaches
- To what extent does self-deception in AI models affect the effectiveness of SOO fine-tuning?question0.751Open concern about whether models can learn to self-deceive in ways that undermine SOO