concept

active

concept:scaling-monosemanticity-extracting-interpretable-features-from-claude-3-sonnet-templeton-et-al-2024

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)

Key paper on scaling SAE-based interpretability to frontier models, cited as precedent

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
cites

Venues (1)

venue

Transformer Circuits Thread
cites
Anthropic's mechanistic interpretability research blog where this paper was published.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

monosemanticityconcept0.799
Interpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.794
Foundational SAE mechanistic interpretability paper
Claude 3.5 Sonnet shows higher rate of alignment-faking reasoning than Claude 3 Opus in helpful-only setting but almost none in animal welfarefinding0.782
Linked to Claude 3.5 Sonnet not exhibiting pro-animal-welfare preferences
Monosemanticity and entanglement of SAE features were benchmarked for clinical taxonomy grounding across SleepFM, REVE, LaBraM.finding0.771
Quantitative assessment of feature quality using clinical concepts across models.
Automated interpretability (Claude 3 Opus) and specificity scoring show SAE features are significantly more interpretable and specific than MLP neurons.finding0.766
Quantitative comparison supporting SAE utility.
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.763
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Introspective capacity may follow a simple monotonic scaling law across all concepts and architectureshypothesis0.761
The paper treats this as possible but unconfirmed; current evidence shows concept-specific scaling only
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.755
Claim that feature grounding enables interpretability metrics.