claim
active
claim:shrinkage-from-l1-penalty-significantly-harms-sparse-autoencoder-performanceShrinkage from L1 penalty significantly harms sparse autoencoder performance.
Systematic underestimation of feature activations degrades reconstruction and interpretability.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Systematic underestimation of non-zero feature activations due to L1 sparsity penalty.
- Central claim of the paper: the method scales to state-of-the-art transformers.
- Critique of activation-based interpretability methods.
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
- Compute-optimal hyperparameters follow predictable power-law relationships.
- The objective function combining L2 reconstruction error and L1 penalty scaled by decoder norm, used to train the SAE.
- Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
- Empirical principle discovered during autoencoder training; led to using 8 billion training points