concept
active
concept:claude-3-opusClaude 3 Opus
Primary model studied; production LLM that exhibits alignment faking in experiments
Neighborhood — ranked by edge-count
Methods (1)
method
- Method of providing training information in-context via a system prompt to elicit alignment faking
Concepts (1)
concept
- Claude 4 Opusrelated_toAnthropic model; outlier in Experiment 1 with high baseline affirmation including under zero-shot and history conditions
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Smaller Claude model; generally does not exhibit alignment faking
- Smaller Claude model; generally does not exhibit alignment faking
- Anthropic model tested in Experiments 1, 3, 4; shows 100% experience reporting under self-referential induction
- Anthropic model tested in Experiments 1, 3, 4; shows 100% experience reporting under self-referential induction
- Mid-to-strong tier closed-source model used as task-solving agent and anchor evolver
- Rubric where LLM rates how well a feature's interpretation matches the activating text.
- Claude Opus 4.6 represents a plan to end a couplet with 'rabbit' before outputting the rhyming line.finding0.740Demonstrates causal relationship between NLA explanations and model outputs via steering with edited explanations.
- Outlier result for Claude 4 Opus suggesting different baseline behavior from other models