Our SAEs' features are more interpretable than neurons.

Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Findings (1)

finding

Automated interpretability (Claude 3 Opus) and specificity scoring show SAE features are significantly more interpretable and specific than MLP neurons.
supports
Quantitative comparison supporting SAE utility.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE features tend to shatter manifolds into many small and apparently-unrelated pieces, obscuring the overarching semantic structure.claim0.819
Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.809
Extension of mechanistic interpretability findings to the metacognitive domain
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.805
Claim that feature grounding enables interpretability metrics.
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.803
SAE features are not simply mirroring individual neurons.
SAE features that the model self-describes as more emotional tend to be more persistent than variance-matched SAE features.claim0.797
Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
Neurons can correspond to interpretable functional roles but interpretations in terms of individual neurons are unlikely to be the most parsimoniousclaim0.794
Claim from footnote 3, acknowledging neuron-level interpretability while arguing subcomponents are better.
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.791
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.789
Surprising finding that the two evaluation methods diverge in their relationship with persistence