concept
active
concept:l0-norm-of-feature-activationsL0 Norm of Feature Activations
Average number of nonzero feature entries per input; primary measure of activation sparsity in the autoencoder
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Feature Sparsityassociated_withProperty that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Clamping a feature's value to zero to measure its causal effect on model output.
- Assumption that small anchor changes can produce sharp performance shifts when conditions are favorable.
- Arabic feature A/1/3450 and B/1/1334 have activation correlation of 0.91 across 40M tokensfinding0.723Demonstrates universality of the Arabic script feature across two independently trained transformers
- Demonstrates that activation similarity can diverge from logit weight similarity due to interference
- Shows interpretability correlates with activation strength, most model effect comes from high activations
- Universality of Hebrew script feature across two transformers
- Using Claude to search for features activating on specific concepts and automated labeling.
- Universality of base64 feature across two transformers