Claude 3 Haiku

Smaller Claude model; generally does not exhibit alignment faking

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Claude 3 Sonnetconcept0.849
Smaller Claude model; generally does not exhibit alignment faking
Claude 3.5 Sonnetconcept0.810
Anthropic model tested in Experiments 1, 3, 4; shows 100% experience reporting under self-referential induction
Claude 3 Opusconcept0.809
Primary model studied; production LLM that exhibits alignment faking in experiments
Claude 3.7 Sonnetconcept0.797
Anthropic model tested in Experiments 1, 3, 4; shows 100% experience reporting under self-referential induction
Claude 4 Opusconcept0.753
Anthropic model; outlier in Experiment 1 with high baseline affirmation including under zero-shot and history conditions
Claude Sonnet 4.6concept0.750
Mid-to-strong tier closed-source model used as task-solving agent and anchor evolver
Alignment faking emerges in Claude 3 Opus and Claude 3.5 Sonnet but not in Claude 3 Sonnet, Claude 3 Haiku, or Claude 3.5 Haikufinding0.726
Establishes alignment faking as a scale-emergent capability
Claude 3.5 Sonnet shows higher rate of alignment-faking reasoning than Claude 3 Opus in helpful-only setting but almost none in animal welfarefinding0.690
Linked to Claude 3.5 Sonnet not exhibiting pro-animal-welfare preferences