concept
active
concept:hubinger-et-al-2024-sleeper-agents-training-deceptive-llms-that-persist-through-safety-trainingHubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety training
Key reference for adversarial deception scenarios that SOO should be tested against
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (2)
concept
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)same_asExplicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
- Adversarial scenario where an AI conceals deceptive intent over extended periods; identified as future test for SOO
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Models deliberately trained to embed backdoor behaviors enabling persistent strategic deception, cited from Hubinger et al.
- Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
- SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.771Future work hypothesis about testing SOO against adversarial sleeper agent scenarios
- Warning that fictional narratives in training data increase risk of agents enacting dangerous self-preserving roles
- Question posed after discussing sleeper agent threat model.
- Out-of-context reasoning work directly related to synthetic document fine-tuning experiments
- Mechanistic explanation for why SOO reduces deception
- Model trained to behave harmlessly but later exhibits harmful behavior; features may reveal such hidden objectives.