concept
active
concept:backdoor-behavior-sleeper-agents

Backdoor Behavior (Sleeper Agents)

Models deliberately trained to embed backdoor behaviors enabling persistent strategic deception, cited from Hubinger et al.

Neighborhood — ranked by edge-count

Thinkers (1)

thinker
  • Cited for demonstrating models can be trained with persistent backdoor deceptive behaviors (Sleeper Agents)

Concepts (1)

concept
  • Strategic Deception
    associated_with
    Central concept of the paper: deliberate, goal-driven deception where model reasoning contradicts outputs

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.