claim
active
claim:sparse-autoencoders-don-t-provide-a-comprehensive-solution-because-they-decode-activations-not-parametersSparse autoencoders don't provide a comprehensive solution because they decode activations, not parameters
Critique of activation-based interpretability methods.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Theoretical and empirical analysis of why AR language models cannot maintain coherence or convergence beyond their context window through local interactions alone.
- Critiques of SAEs for mechanistic interpretability, focusing on activation vs. parameter decoding gaps.
Methods (1)
method
- Sparse Autoencoders (SAE)associated_withInterpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Central claim of the paper: the method scales to state-of-the-art transformers.
- Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.869Core methodology paper for SAE-based interpretable feature extraction
- Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
- Forward-looking prediction about scalability of the method to larger models
- Empirical principle discovered during autoencoder training; led to using 8 billion training points
- Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
- Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
Cross-corpus bridges (1)
same_concept_as · Nomic cosineExternal markdown files that talk about the same concept as this entity.
- aboutblank_kbAutoencoder Architectureframeworks/variational-autoencoder-architecture.md0.804