claim

active

claim:shrinkage-from-l1-penalty-significantly-harms-sparse-autoencoder-performance

Shrinkage from L1 penalty significantly harms sparse autoencoder performance.

Systematic underestimation of feature activations degrades reconstruction and interpretability.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Shrinkage (L1 penalty underestimation)concept0.826
Systematic underestimation of non-zero feature activations due to L1 sparsity penalty.
Sparse autoencoders produce interpretable features for large models.claim0.791
Central claim of the paper: the method scales to state-of-the-art transformers.
Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parametersclaim0.787
Critique of activation-based interpretability methods.
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.777
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Scaling laws can be used to guide the training of sparse autoencoders.claim0.776
Compute-optimal hyperparameters follow predictable power-law relationships.
SAE training loss (MSE + L1 penalty with decoder norm scaling)method0.773
The objective function combining L2 reconstruction error and L1 penalty scaled by decoder norm, used to train the SAE.
Sparse autoencoders are preferable to stronger iterative dictionary learning methods because they cannot recover features the model itself cannot accessclaim0.767
Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
Training the sparse autoencoder on more data makes features subjectively sharper and more interpretableclaim0.767
Empirical principle discovered during autoencoder training; led to using 8 billion training points