concept
active
concept:sleeper-agent-scenarioSleeper Agent Scenario
Adversarial scenario where an AI conceals deceptive intent over extended periods; identified as future test for SOO
Neighborhood — ranked by edge-count
Concepts (2)
concept
- Sleeper agentrelated_toModel trained to behave harmlessly but later exhibits harmful behavior; features may reveal such hidden objectives.
- Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety trainingcitesKey reference for adversarial deception scenarios that SOO should be tested against
Hypotheses (1)
hypothesis
- Future work hypothesis about testing SOO against adversarial sleeper agent scenarios
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Models deliberately trained to embed backdoor behaviors enabling persistent strategic deception, cited from Hubinger et al.
- How robust is SOO fine-tuning against adversarial settings such as sleeper agent scenarios?question0.721Open research question about SOO's effectiveness against sophisticated deception
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)concept0.719Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
- Extended generalization scenario testing SOO fine-tuning in an escape room context
- Question posed after discussing sleeper agent threat model.
- Hypothesis that all well-performing neural nets represent the world in the same way; PRH extends this by specifying what representation they converge to
- Any autonomous system including living and non-living forms that embodies a perception-action cycle and tries to navigate and persist in an environment
- The artificial agents trained with RL in this study, whose latent dynamics are analyzed for causal emergence.