finding

active

finding:34m-sae-had-roughly-65-dead-features

34M SAE had roughly 65% dead features.

Most features dead in largest SAE, indicating room for improvement.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

For all three SAEs (1M, 4M, 34M), average active features per token <300, and reconstruction variance explained ≥65%.finding0.802
Basic SAE performance metrics.
A 'San Francisco' feature in 1M SAE splits into 11 fine-grained features in 34M SAE.finding0.775
Empirical observation of feature splitting.
SAE sparse features (100K+ features, 64 active per token)concept0.775
The specific SAE architecture trained: 100K+ possible features compressed to 64 active per token for layer-40 activations
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.765
SAE features are not simply mirroring individual neurons.
168 of 4,096 A/1 features are dead and 292 are ultralow density, leaving 3,636 for analysisfinding0.764
Characterizes the live vs dead feature distribution in the main autoencoder run
SAE Feature #77278 fires 195,040 times in corpus, associated with satisfaction vs. emptiness dimensionfinding0.757
High-frequency SAE feature reported as controlling fundamental positive vs. negative affect dimension
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.753
Claim that feature grounding enables interpretability metrics.
SAE Feature #11100 associated with panic, 93rd percentile emotion subspace fractionfinding0.732
Shows high emotion subspace overlap for a specific negative emotion feature