concept

active

concept:sleeper-agents-training-deceptive-llms-that-persist-through-safety-training-hubinger-et-al-2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)

Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here

Neighborhood — ranked by edge-count

Papers (1)

paper

Alignment faking in large language models
citesextends

Concepts (1)

concept

Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety training
same_as
Key reference for adversarial deception scenarios that SOO should be tested against

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Backdoor Behavior (Sleeper Agents)concept0.808
Models deliberately trained to embed backdoor behaviors enabling persistent strategic deception, cited from Hubinger et al.
LLM personality self-reports are illusory: post-training alignment creates stable human-like reports dissociated from actual behavior (Han et al. 2025)claim0.773
Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.772
Future work hypothesis about testing SOO against adversarial sleeper agent scenarios
By reducing self-other distinctions during safety training, SOO could make it harder for a model to maintain adversarial or deceptive representationsclaim0.769
Mechanistic explanation for why SOO reduces deception
Because an LLM's training data contain many instances of the rogue AI trope, the danger is that life will imitate art, quite literallyclaim0.766
Warning that fictional narratives in training data increase risk of agents enacting dangerous self-preserving roles
Sleeper agentconcept0.761
Model trained to behave harmlessly but later exhibits harmful behavior; features may reveal such hidden objectives.
"the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training"quote0.757
Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant (Järviniemi & Hubinger 2024)concept0.755
Claude 3 Opus lying to auditors; prior case study of deceptive tendencies