concept
active
concept:claude-3-sonnetClaude 3 Sonnet
Smaller Claude model; generally does not exhibit alignment faking
Neighborhood — ranked by edge-count
Concepts (3)
concept
- Claude 3.5 Sonnetrelated_toAnthropic model tested in Experiments 1, 3, 4; shows 100% experience reporting under self-referential induction
- Claude 3.7 Sonnetrelated_toAnthropic model tested in Experiments 1, 3, 4; shows 100% experience reporting under self-referential induction
- Claude Sonnet 4.6related_toMid-to-strong tier closed-source model used as task-solving agent and anchor evolver
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Smaller Claude model; generally does not exhibit alignment faking
- Primary model studied; production LLM that exhibits alignment faking in experiments
- Anthropic model; outlier in Experiment 1 with high baseline affirmation including under zero-shot and history conditions
- Linked to Claude 3.5 Sonnet not exhibiting pro-animal-welfare preferences
- Establishes alignment faking as a scale-emergent capability
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)concept0.706Key paper on scaling SAE-based interpretability to frontier models, cited as precedent
- Specific application of SAE to extract features from the middle layer of Claude 3 Sonnet, at three scales (1M, 4M, 34M features).
- Specific result for Claude 3.5 Sonnet in Experiment 1