framework
active
framework:sparse-autoencoder-for-dictionary-learningSparse Autoencoder for Dictionary Learning
Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
Neighborhood — ranked by edge-count
Thinkers (3)
thinker
- Trenton BrickenstudiesToy models of superposition.
- Lee SharkeystudiesCo-author of VPD paper; mechanistic interpretability researcher affiliated with Goodfire.
- Hoagy CunninghamstudiesCo-author; de-risked residual-stream SAEs, ran feature completeness analysis.
Methods (1)
method
- Sparse Dictionary LearningimplementsGeneral method for finding overcomplete sparse decompositions; the paper uses sparse autoencoders as an approximation
Concepts (2)
concept
- Pre-Encoder Biasassociated_withArchitectural modification subtracting a learned bias from autoencoder inputs before encoding; initialized to geometric median of dataset; improves autoencoder performance
- Untied Decoder Weightsassociated_withAutoencoder design choice to learn separate encoder and decoder weights, increasing representational capacity by allowing encoder vectors to distinguish similar features
Claims (1)
claim
- Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
Frameworks (3)
framework
- Sparse Autoencoderrelated_toInterpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
- Superposition HypothesisimplementsCore theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition
- Compressed SensingextendsMathematical framework enabling recovery of high-dimensional sparse vectors from low-dimensional projections; theoretically underpins sparse autoencoder approach
Artifacts (1)
artifact
- Interactive interface for exploring all 90 learned dictionaries' features, including activating examples, logit effects, and ablations
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
- Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
- Empirical principle discovered during autoencoder training; led to using 8 billion training points
- The central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries
- Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.805Core methodology paper for SAE-based interpretable feature extraction
- The extracted set of sparse interpretable features from model embeddings via SAEs
- SAEs trained on 100M+ tokens to compress token layer-40 activations into 64 active features out of 100K+ for interpretability analysis
- Central claim of the paper: the method scales to state-of-the-art transformers.