what features activate when Claude is trained to be a sleeper agent?

Question posed after discussing sleeper agent threat model.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

what features activate on tokens we'd expect to signify Claude's self-identity?question0.806
Open question from the discussion on future research directions.
what features need to activate / remain inactive for Claude to give advice on producing Chemical, Biological, Radiological or Nuclear (CBRN) weapons?question0.789
Potential safety claim about suppressing features to prevent CBRN advice.
what features activate when we ask questions probing Claude's goals and values?question0.783
Direction for understanding model's internal objectives via features.
Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety trainingconcept0.761
Key reference for adversarial deception scenarios that SOO should be tested against
what features activate when we ask Claude questions about its subjective experience?question0.754
Question about features related to consciousness and self-report.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)concept0.753
Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
Claude achieves significantly higher Spearman correlation predicting feature activations vs neuron activationsfinding0.745
Automated interpretability analysis of activations confirms features are more interpretable than neurons
Mechanism by which activation of an emotion feature sometimes leads to later suppression of that same featurequestion0.737
Identified research gap: the paper observes anti-persistence but has no explanation for it