concept
active
concept:scaling-monosemanticity-extracting-interpretable-features-from-claude-3-sonnet-templeton-et-al-2024Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)
Key paper on scaling SAE-based interpretability to frontier models, cited as precedent
Neighborhood — ranked by edge-count
Papers (1)
paper
Venues (1)
venue
- Anthropic's mechanistic interpretability research blog where this paper was published.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Interpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
- Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.794Foundational SAE mechanistic interpretability paper
- Linked to Claude 3.5 Sonnet not exhibiting pro-animal-welfare preferences
- Quantitative assessment of feature quality using clinical concepts across models.
- Quantitative comparison supporting SAE utility.
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
- Introspective capacity may follow a simple monotonic scaling law across all concepts and architectureshypothesis0.761The paper treats this as possible but unconfirmed; current evidence shows concept-specific scaling only
- Claim that feature grounding enables interpretability metrics.