method
active
method:sae-training-loss-mse-l1-penalty-with-decoder-norm-scalingSAE training loss (MSE + L1 penalty with decoder norm scaling)
The objective function combining L2 reconstruction error and L1 penalty scaled by decoder norm, used to train the SAE.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- SAE training loss decreases as a power law with compute budget when using compute-optimal hyperparameters.finding0.837From scaling laws sweep.
- A promising property for interpretability analysis off-distribution.
- Systematic underestimation of feature activations degrades reconstruction and interpretability.
- Ethical implication about the nature of AI training experience if the thesis holds
- Training stability analysis.
- Computing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.
- Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
- Systematic underestimation of non-zero feature activations due to L1 sparsity penalty.