concept

active

concept:hubinger-et-al-2024-sleeper-agents-training-deceptive-llms-that-persist-through-safety-training

Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety training

Key reference for adversarial deception scenarios that SOO should be tested against

Neighborhood — ranked by edge-count

Papers (1)

paper

Towards Safe and Honest AI Agents with Neural Self-Other Overlap
cites

Concepts (2)

concept

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)
same_as
Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
Sleeper Agent Scenario
cites
Adversarial scenario where an AI conceals deceptive intent over extended periods; identified as future test for SOO

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Backdoor Behavior (Sleeper Agents)concept0.810
Models deliberately trained to embed backdoor behaviors enabling persistent strategic deception, cited from Hubinger et al.
LLM personality self-reports are illusory: post-training alignment creates stable human-like reports dissociated from actual behavior (Han et al. 2025)claim0.775
Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.771
Future work hypothesis about testing SOO against adversarial sleeper agent scenarios
Because an LLM's training data contain many instances of the rogue AI trope, the danger is that life will imitate art, quite literallyclaim0.766
Warning that fictional narratives in training data increase risk of agents enacting dangerous self-preserving roles
what features activate when Claude is trained to be a sleeper agent?question0.761
Question posed after discussing sleeper agent threat model.
Connecting the Dots: LLMs Can Infer and Verbalize Latent Structure from Disparate Training Data (Treutlein et al. 2024)concept0.758
Out-of-context reasoning work directly related to synthetic document fine-tuning experiments
By reducing self-other distinctions during safety training, SOO could make it harder for a model to maintain adversarial or deceptive representationsclaim0.757
Mechanistic explanation for why SOO reduces deception
Sleeper agentconcept0.754
Model trained to behave harmlessly but later exhibits harmful behavior; features may reveal such hidden objectives.