Feature Sparsity

Property that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work

Neighborhood — ranked by edge-count

framework

Superposition Hypothesis
supports
Core theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition

concept

L0 Norm of Feature Activations
associated_with
Average number of nonzero feature entries per input; primary measure of activation sparsity in the autoencoder

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse Feature Dictionaryconcept0.814
The extracted set of sparse interpretable features from model embeddings via SAEs
Superposition of Sparse Featuresconcept0.784
Mechanistic finding by Bricken et al. 2023 about how LLMs store features; cited as operational justification for pattern-repository assumption
Feature splittingconcept0.782
Phenomenon where a feature in a small SAE splits into multiple finer features in a larger SAE.
Feature Visualizationmethod0.780
Method of optimizing input to cause a neuron to fire maximally, used to characterize what a neuron detects; establishes causal link
Feature Universalityconcept0.779
Property of features that form consistently across different models trained on the same or similar data, suggesting features are real representational units
Feature Loopingconcept0.774
Repetitive behavioral pattern observed under high steering strengths in SAE feature self-steering experiments
Pure Featureconcept0.774
A feature that responds to only a single latent variable, contrasted with polysemantic features
Feature co-occurrenceconcept0.768
Patterns of which features activate together across tokens; preserved by covariance pooling but lost in mean pooling.