finding
active
finding:82-of-features-in-1m-sae-had-maximum-pearson-correlation-0-3-with-any-mlp-neuron-and-manual-inspection-showed-no-semantic-resemblance82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.
SAE features are not simply mirroring individual neurons.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Quantitative comparison supporting SAE utility.
- Systematic comparison showing features are substantially more universal than neurons across models
- Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
- Claim that feature grounding enables interpretability metrics.
- Shows base64 feature is polysemantic at neuron level but monosemantic as learned feature
- Quantitative assessment of feature quality using clinical concepts across models.
- Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.782Shows low agreement between the two evaluation modalities
- Validates robustness of alignment metric choice