concept
active
concept:superposition-of-sparse-featuresSuperposition of Sparse Features
Mechanistic finding by Bricken et al. 2023 about how LLMs store features; cited as operational justification for pattern-repository assumption
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Unlabeled statistical regularities stored during pretraining.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The extracted set of sparse interpretable features from model embeddings via SAEs
- Property that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work
- The phenomenon where the residual stream communicates many more features than its dimensionality by encoding information across overlapping subspaces
- Theoretical model of how neural networks encode more features than dimensions, informing linear representation work.
- Specific phrases or sequences memorized via binary features in superposition, enabling narrow pattern matching despite few neurons
- Coding scheme where qualities are represented by few neurons with continuous similarity relations.
- The state in which a dialogue agent maintains multiple possible characters simultaneously, refined as the conversation proceeds
- Phenomenon where models represent more features than dimensions via almost-orthogonal directions.