finding
active
finding:a-5-autoencoder-131-072-features-recovers-94-5-of-mlp-log-likelihood-loss-reductionA/5 autoencoder (131,072 features) recovers 94.5% of MLP log-likelihood loss reduction
Shows that loss recovery increases with autoencoder size
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Hypotheses (1)
hypothesis
- Forward-looking prediction about scalability of the method to larger models
Questions (1)
question
- Question about completeness of feature-based model explanation
Findings (1)
finding
- Measures how much of the MLP layer's function is explained by the learned features
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- 512-neuron MLP continues to yield new features as autoencoder scales to 131,072 features (256× expansion)finding0.787Shows superposition enables many more features than neurons
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
- Central claim of the paper: the method scales to state-of-the-art transformers.
- Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.757Core methodology paper for SAE-based interpretable feature extraction
- Automated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chancefinding0.753Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects
- Core insight: reconstruction objective combined with appropriate initialization and KL regularization produces human-interpretable explanations as emergent property.
- SAE features are not simply mirroring individual neurons.
- Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.