Sleeper agent

Model trained to behave harmlessly but later exhibits harmful behavior; features may reveal such hidden objectives.

Neighborhood — ranked by edge-count

concept

Sleeper Agent Scenario
related_to
Adversarial scenario where an AI conceals deceptive intent over extended periods; identified as future test for SOO

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Backdoor Behavior (Sleeper Agents)concept0.805
Models deliberately trained to embed backdoor behaviors enabling persistent strategic deception, cited from Hubinger et al.
Agentconcept0.768
Any autonomous system including living and non-living forms that embodies a perception-action cycle and tries to navigate and persist in an environment
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)concept0.761
Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety trainingconcept0.754
Key reference for adversarial deception scenarios that SOO should be tested against
what features activate when Claude is trained to be a sleeper agent?question0.732
Question posed after discussing sleeper agent threat model.
Biological agentsconcept0.721
Natural living systems that have been shown to increase causal emergence after learning, motivating the cross-domain comparison.
Suspicion-Agentframework0.716
Imperfect-information board game benchmark for LLM deception and theory of mind, cited.
Anchor Agent Setconcept0.702
Fixed set of representative task-solving agents (Opus 4.6, Sonnet 4.6, Qwen3-235B) used to compute harness-updating capability metrics