question
active
question:what-features-activate-when-claude-is-trained-to-be-a-sleeper-agentwhat features activate when Claude is trained to be a sleeper agent?
Question posed after discussing sleeper agent threat model.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Open question from the discussion on future research directions.
- Potential safety claim about suppressing features to prevent CBRN advice.
- Direction for understanding model's internal objectives via features.
- Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety trainingconcept0.761Key reference for adversarial deception scenarios that SOO should be tested against
- Question about features related to consciousness and self-report.
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)concept0.753Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
- Automated interpretability analysis of activations confirms features are more interpretable than neurons
- Mechanism by which activation of an emotion feature sometimes leads to later suppression of that same featurequestion0.737Identified research gap: the paper observes anti-persistence but has no explanation for it