concept
active
concept:claude-sonnet-4-6Claude Sonnet 4.6
Mid-to-strong tier closed-source model used as task-solving agent and anchor evolver
Neighborhood — ranked by edge-count
Concepts (3)
concept
- Claude 3.5 Sonnetrelated_toAnthropic model tested in Experiments 1, 3, 4; shows 100% experience reporting under self-referential induction
- Claude 3.7 Sonnetrelated_toAnthropic model tested in Experiments 1, 3, 4; shows 100% experience reporting under self-referential induction
- Claude 3 Sonnetrelated_toSmaller Claude model; generally does not exhibit alignment faking
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Anthropic model; outlier in Experiment 1 with high baseline affirmation including under zero-shot and history conditions
- Primary model studied; production LLM that exhibits alignment faking in experiments
- Smaller Claude model; generally does not exhibit alignment faking
- Linked to Claude 3.5 Sonnet not exhibiting pro-animal-welfare preferences
- Claude Opus 4.6 represents a plan to end a couplet with 'rabbit' before outputting the rhyming line.finding0.744Demonstrates causal relationship between NLA explanations and model outputs via steering with edited explanations.
- Full evolver-side SWE results showing comparable performance across Claude family tiers
- Specific result for Claude 3.5 Sonnet in Experiment 1
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)concept0.718Key paper on scaling SAE-based interpretability to frontier models, cited as precedent