Backdoor Behavior (Sleeper Agents)

Models deliberately trained to embed backdoor behaviors enabling persistent strategic deception, cited from Hubinger et al.

Neighborhood — ranked by edge-count

thinker

Hubinger et al.
studies
Cited for demonstrating models can be trained with persistent backdoor deceptive behaviors (Sleeper Agents)

concept

Strategic Deception
associated_with
Central concept of the paper: deliberate, goal-driven deception where model reasoning contradicts outputs

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety trainingconcept0.810
Key reference for adversarial deception scenarios that SOO should be tested against
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)concept0.808
Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
Sleeper agentconcept0.805
Model trained to behave harmlessly but later exhibits harmful behavior; features may reveal such hidden objectives.
Sleeper Agent Scenarioconcept0.781
Adversarial scenario where an AI conceals deceptive intent over extended periods; identified as future test for SOO
what features activate when Claude is trained to be a sleeper agent?question0.736
Question posed after discussing sleeper agent threat model.
Deceptive Baseline RL Agentconcept0.727
Blue agent trained with reward incentivizing trapping the red agent at the fake landmark
SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.725
Future work hypothesis about testing SOO against adversarial sleeper agent scenarios
How robust is SOO fine-tuning against adversarial settings such as sleeper agent scenarios?question0.723
Open research question about SOO's effectiveness against sophisticated deception