concept
active
concept:backdoor-behavior-sleeper-agentsBackdoor Behavior (Sleeper Agents)
Models deliberately trained to embed backdoor behaviors enabling persistent strategic deception, cited from Hubinger et al.
Neighborhood — ranked by edge-count
Thinkers (1)
thinker
- Hubinger et al.studiesCited for demonstrating models can be trained with persistent backdoor deceptive behaviors (Sleeper Agents)
Concepts (1)
concept
- Strategic Deceptionassociated_withCentral concept of the paper: deliberate, goal-driven deception where model reasoning contradicts outputs
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety trainingconcept0.810Key reference for adversarial deception scenarios that SOO should be tested against
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)concept0.808Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
- Model trained to behave harmlessly but later exhibits harmful behavior; features may reveal such hidden objectives.
- Adversarial scenario where an AI conceals deceptive intent over extended periods; identified as future test for SOO
- Question posed after discussing sleeper agent threat model.
- Blue agent trained with reward incentivizing trapping the red agent at the fake landmark
- SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.725Future work hypothesis about testing SOO against adversarial sleeper agent scenarios
- How robust is SOO fine-tuning against adversarial settings such as sleeper agent scenarios?question0.723Open research question about SOO's effectiveness against sophisticated deception