method
active
method:sparse-dictionary-learningSparse Dictionary Learning
General method for finding overcomplete sparse decompositions; the paper uses sparse autoencoders as an approximation
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The extracted set of sparse interpretable features from model embeddings via SAEs
- Motivation for using sparsity-based dictionary learning on language models
- Bricken et al.'s method for decomposing language models into interpretable features; cited as AI alignment interpretability relevant to consciousness detection
- Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
- Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
- Inference of parameters encoding contingencies of the world (e.g., likelihood matrix A) at slower timescale than perception.
- Coding scheme where qualities are represented by few neurons with continuous similarity relations.
- Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations