Sleeper Agent Scenario

Adversarial scenario where an AI conceals deceptive intent over extended periods; identified as future test for SOO

Neighborhood — ranked by edge-count

concept

Sleeper agent
related_to
Model trained to behave harmlessly but later exhibits harmful behavior; features may reveal such hidden objectives.
Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety training
cites
Key reference for adversarial deception scenarios that SOO should be tested against

hypothesis

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Backdoor Behavior (Sleeper Agents)concept0.781
Models deliberately trained to embed backdoor behaviors enabling persistent strategic deception, cited from Hubinger et al.
How robust is SOO fine-tuning against adversarial settings such as sleeper agent scenarios?question0.721
Open research question about SOO's effectiveness against sophisticated deception
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)concept0.719
Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
Escape Room Scenariomethod0.719
Extended generalization scenario testing SOO fine-tuning in an escape room context
what features activate when Claude is trained to be a sleeper agent?question0.715
Question posed after discussing sleeper agent threat model.
Anna Karenina Scenarioconcept0.706
Hypothesis that all well-performing neural nets represent the world in the same way; PRH extends this by specifying what representation they converge to
Agentconcept0.692
Any autonomous system including living and non-living forms that embodies a perception-action cycle and tries to navigate and persist in an environment
Neural-network agentsconcept0.686
The artificial agents trained with RL in this study, whose latent dynamics are analyzed for causal emergence.