finding

active

finding:automated-interpretability-claude-3-opus-and-specificity-scoring-show-sae-features-are-significantly-more-interpretable-and-specific-than-mlp-neurons

Automated interpretability (Claude 3 Opus) and specificity scoring show SAE features are significantly more interpretable and specific than MLP neurons.

Quantitative comparison supporting SAE utility.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Claims (1)

claim

Our SAEs' features are more interpretable than neurons.
supports
Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.840
SAE features are not simply mirroring individual neurons.
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.798
Claim that feature grounding enables interpretability metrics.
Automated interpretability using LLMs can usefully score feature specificity.claim0.797
Claude 3 Opus ratings aligned with human judgment of feature descriptions.
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.792
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
SAE features trained on text activations generalize to image inputs, activating on relevant visual depictions.finding0.782
Out-of-distribution generalization of SAE features.
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.780
Surprising finding that the two evaluation methods diverge in their relationship with persistence
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.776
Extension of mechanistic interpretability findings to the metacognitive domain
SAE-based mechanistic interpretability will be superseded by manifold-based analysis for understanding semantic concepts within 24 months.prediction0.776