concept
active
concept:sleeper-agentSleeper agent
Model trained to behave harmlessly but later exhibits harmful behavior; features may reveal such hidden objectives.
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Sleeper Agent Scenariorelated_toAdversarial scenario where an AI conceals deceptive intent over extended periods; identified as future test for SOO
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Models deliberately trained to embed backdoor behaviors enabling persistent strategic deception, cited from Hubinger et al.
- Any autonomous system including living and non-living forms that embodies a perception-action cycle and tries to navigate and persist in an environment
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)concept0.761Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
- Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety trainingconcept0.754Key reference for adversarial deception scenarios that SOO should be tested against
- Question posed after discussing sleeper agent threat model.
- Natural living systems that have been shown to increase causal emergence after learning, motivating the cross-domain comparison.
- Imperfect-information board game benchmark for LLM deception and theory of mind, cited.
- Fixed set of representative task-solving agents (Opus 4.6, Sonnet 4.6, Qwen3-235B) used to compute harness-updating capability metrics