concept
active
concept:sparse-feature-dictionarySparse Feature Dictionary
The extracted set of sparse interpretable features from model embeddings via SAEs
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- TopK Sparse AutoencodersintroducesThe central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- General method for finding overcomplete sparse decompositions; the paper uses sparse autoencoders as an approximation
- Property that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work
- Mechanistic finding by Bricken et al. 2023 about how LLMs store features; cited as operational justification for pattern-repository assumption
- Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
- Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
- Fraction of training tokens on which a given feature has nonzero activation; used as proxy metric for autoencoder quality
- Coding scheme where qualities are represented by few neurons with continuous similarity relations.
- Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence