finding

active

finding:82-of-features-in-1m-sae-had-maximum-pearson-correlation-0-3-with-any-mlp-neuron-and-manual-inspection-showed-no-semantic-resemblance

82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.

SAE features are not simply mirroring individual neurons.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Automated interpretability (Claude 3 Opus) and specificity scoring show SAE features are significantly more interpretable and specific than MLP neurons.finding0.840
Quantitative comparison supporting SAE utility.
Features in A/1 have median activation correlation of 0.72 with most similar feature in B/1; neurons have median 0.46finding0.807
Systematic comparison showing features are substantially more universal than neurons across models
Our SAEs' features are more interpretable than neurons.claim0.803
Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.786
Claim that feature grounding enables interpretability metrics.
Most correlated neuron A/neurons/470 has correlation of only 0.18 with base64 feature A/1/2357 and responds to code, HTML labels, URLsfinding0.785
Shows base64 feature is polysemantic at neuron level but monosemantic as learned feature
Monosemanticity and entanglement of SAE features were benchmarked for clinical taxonomy grounding across SleepFM, REVE, LaBraM.finding0.783
Quantitative assessment of feature quality using clinical concepts across models.
Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.782
Shows low agreement between the two evaluation modalities
Spearman's rank correlation among different alignment metrics (CKA, SVCCA, Mutual k-NN, CKNNA) over 78 vision models is high across variants, with all p-values below 2.24×10^-105finding0.777
Validates robustness of alignment metric choice