claim
active
claim:sparse-autoencoders-extract-features-that-are-significantly-more-monosemantic-than-neurons-as-shown-by-four-independent-lines-of-evidenceSparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidence
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Findings (7)
finding
- Demonstrates that the Arabic feature is not aligned to any single neuron
- Demonstrates activation specificity of the Arabic script sparse autoencoder feature
- Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects
- Automated interpretability analysis of activations confirms features are more interpretable than neurons
- Shows interpretability correlates with activation strength, most model effect comes from high activations
- Human analysis showing features are substantially more interpretable than neurons
- Hebrew feature is effectively invisible in the neuron basis
Hypotheses (1)
hypothesis
- Forward-looking prediction about scalability of the method to larger models
Questions (1)
question
- Central methodological gap: current metrics (loss, density histograms, manual inspection) are inadequate
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Central claim of the paper: the method scales to state-of-the-art transformers.
- Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.845Core methodology paper for SAE-based interpretable feature extraction
- Critique of activation-based interpretability methods.
- Empirical principle discovered during autoencoder training; led to using 8 billion training points
- Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
- Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
- Foundational empirical result enabling all downstream analysis
- Quantitative comparison supporting SAE utility.