Claude 3 Opus

Primary model studied; production LLM that exhibits alignment faking in experiments

Neighborhood — ranked by edge-count

method

Helpful-Only System Prompt Setup
cites
Method of providing training information in-context via a system prompt to elicit alignment faking

concept

Claude 4 Opus
related_to
Anthropic model; outlier in Experiment 1 with high baseline affirmation including under zero-shot and history conditions

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Claude 3 Sonnetconcept0.843
Smaller Claude model; generally does not exhibit alignment faking
Claude 3 Haikuconcept0.809
Smaller Claude model; generally does not exhibit alignment faking
Claude 3.5 Sonnetconcept0.809
Anthropic model tested in Experiments 1, 3, 4; shows 100% experience reporting under self-referential induction
Claude 3.7 Sonnetconcept0.792
Anthropic model tested in Experiments 1, 3, 4; shows 100% experience reporting under self-referential induction
Claude Sonnet 4.6concept0.754
Mid-to-strong tier closed-source model used as task-solving agent and anchor evolver
Specificity scoring rubric (0-3 scale) with Claude 3 Opusmethod0.753
Rubric where LLM rates how well a feature's interpretation matches the activating text.
Claude Opus 4.6 represents a plan to end a couplet with 'rabbit' before outputting the rhyming line.finding0.740
Demonstrates causal relationship between NLA explanations and model outputs via steering with edited explanations.
Claude 4 Opus reports subjective experience in 100% experimental, 82% history, 22% conceptual, and 100% zero-shot trialsfinding0.729
Outlier result for Claude 4 Opus suggesting different baseline behavior from other models