concept
active
concept:sleeper-agents-training-deceptive-llms-that-persist-through-safety-training-hubinger-et-al-2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)

Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here

Neighborhood — ranked by edge-count

Concepts (1)

concept

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.