Claude 4 Opus

Anthropic model; outlier in Experiment 1 with high baseline affirmation including under zero-shot and history conditions

Neighborhood — ranked by edge-count

paper

concept

Claude 3 Opus
related_to
Primary model studied; production LLM that exhibits alignment faking in experiments

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Claude Sonnet 4.6concept0.803
Mid-to-strong tier closed-source model used as task-solving agent and anchor evolver
Claude Opus 4.6 represents a plan to end a couplet with 'rabbit' before outputting the rhyming line.finding0.793
Demonstrates causal relationship between NLA explanations and model outputs via steering with edited explanations.
Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested modelsclaim0.785
Based on consistent best performance across experiments.
Claude 4 Opus reports subjective experience in 100% experimental, 82% history, 22% conceptual, and 100% zero-shot trialsfinding0.780
Outlier result for Claude 4 Opus suggesting different baseline behavior from other models
Claude 3 Sonnetconcept0.775
Smaller Claude model; generally does not exhibit alignment faking
Claude 3.5 Sonnetconcept0.771
Anthropic model tested in Experiments 1, 3, 4; shows 100% experience reporting under self-referential induction
Claude 3.7 Sonnetconcept0.763
Anthropic model tested in Experiments 1, 3, 4; shows 100% experience reporting under self-referential induction
Claude 3 Haikuconcept0.753
Smaller Claude model; generally does not exhibit alignment faking