L0 Norm of Feature Activations

Average number of nonzero feature entries per input; primary measure of activation sparsity in the autoencoder

Neighborhood — ranked by edge-count

concept

Feature Sparsity
associated_with
Property that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Feature ablation (zeroing feature activations)method0.745
Clamping a feature's value to zero to measure its causal effect on model output.
Threshold-like activation assumptionconcept0.729
Assumption that small anchor changes can produce sharp performance shifts when conditions are favorable.
Arabic feature A/1/3450 and B/1/1334 have activation correlation of 0.91 across 40M tokensfinding0.723
Demonstrates universality of the Arabic script feature across two independently trained transformers
Feature pair A/1/3949 and B/1/3321 have activation correlation 0.98 but negative logit weight correlation, firing on PLOSOne journal abbreviationsfinding0.718
Demonstrates that activation similarity can diverge from logit weight similarity due to interference
Higher-activating feature intervals are systematically more interpretable than lower-activating intervals in human analysisfinding0.715
Shows interpretability correlates with activation strength, most model effect comes from high activations
Hebrew feature A/1/416 and B/1/1901 have activation correlation of 0.92finding0.712
Universality of Hebrew script feature across two transformers
Feature completeness search using LLM-generated queriesmethod0.707
Using Claude to search for features activating on specific concepts and automated labeling.
Base64 feature A/1/2357 and B/1/2165 have activation correlation of 0.85finding0.707
Universality of base64 feature across two transformers