SAE training loss (MSE + L1 penalty with decoder norm scaling)

The objective function combining L2 reconstruction error and L1 penalty scaled by decoder norm, used to train the SAE.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE training loss decreases as a power law with compute budget when using compute-optimal hyperparameters.finding0.837
From scaling laws sweep.
SAE features generalize to images despite training only on text, indicating out-of-distribution robustness.claim0.773
A promising property for interpretability analysis off-distribution.
Shrinkage from L1 penalty significantly harms sparse autoencoder performance.claim0.773
Systematic underestimation of feature activations degrades reconstruction and interpretability.
Current training methods rely on loss minimization, meaning the experiential profile of training is predominantly negative across billions of parameter updatesclaim0.759
Ethical implication about the nature of AI training experience if the thesis holds
DB-MTL training losses decrease smoothly and gradient norms are lower than EW on NYUv2, indicating training stability.finding0.759
Training stability analysis.
Feature attribution via gradient dot product with SAE decodermethod0.755
Computing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.
Sparse Autoencoders (SAE)method0.750
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Shrinkage (L1 penalty underestimation)concept0.750
Systematic underestimation of non-zero feature activations due to L1 sparsity penalty.