claim
active
claim:sparse-autoencoders-are-preferable-to-stronger-iterative-dictionary-learning-methods-because-they-cannot-recover-features-the-model-itself-cannot-accessSparse autoencoders are preferable to stronger iterative dictionary learning methods because they cannot recover features the model itself cannot access
Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Critique of activation-based interpretability methods.
- Central claim of the paper: the method scales to state-of-the-art transformers.
- Forward-looking prediction about scalability of the method to larger models
- Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.852Core methodology paper for SAE-based interpretable feature extraction
- Empirical principle discovered during autoencoder training; led to using 8 billion training points
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
- Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
- Motivation for using sparsity-based dictionary learning on language models