concept
active
concept:feature-sparsityFeature Sparsity
Property that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Superposition HypothesissupportsCore theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition
Concepts (1)
concept
- L0 Norm of Feature Activationsassociated_withAverage number of nonzero feature entries per input; primary measure of activation sparsity in the autoencoder
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The extracted set of sparse interpretable features from model embeddings via SAEs
- Mechanistic finding by Bricken et al. 2023 about how LLMs store features; cited as operational justification for pattern-repository assumption
- Phenomenon where a feature in a small SAE splits into multiple finer features in a larger SAE.
- Method of optimizing input to cause a neuron to fire maximally, used to characterize what a neuron detects; establishes causal link
- Property of features that form consistently across different models trained on the same or similar data, suggesting features are real representational units
- Repetitive behavioral pattern observed under high steering strengths in SAE feature self-steering experiments
- A feature that responds to only a single latent variable, contrasted with polysemantic features
- Patterns of which features activate together across tokens; preserved by covariance pooling but lost in mean pooling.