concept
active
concept:sleeper-agents-training-deceptive-llms-that-persist-through-safety-training-hubinger-et-al-2024Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)
Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
Neighborhood — ranked by edge-count
Papers (1)
paper
- Alignment faking in large language modelscitesextends
Concepts (1)
concept
- Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety trainingsame_asKey reference for adversarial deception scenarios that SOO should be tested against
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Models deliberately trained to embed backdoor behaviors enabling persistent strategic deception, cited from Hubinger et al.
- Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
- SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.772Future work hypothesis about testing SOO against adversarial sleeper agent scenarios
- Mechanistic explanation for why SOO reduces deception
- Warning that fictional narratives in training data increase risk of agents enacting dangerous self-preserving roles
- Model trained to behave harmlessly but later exhibits harmful behavior; features may reveal such hidden objectives.
- Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
- Claude 3 Opus lying to auditors; prior case study of deceptive tendencies