method
active
method:sparse-autoencoder-training-on-layer-40-activationsSparse Autoencoder Training on Layer-40 Activations
SAEs trained on 100M+ tokens to compress token layer-40 activations into 64 active features out of 100K+ for interpretability analysis
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Sparse AutoencoderimplementsInterpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
Concepts (1)
concept
- The specific neural network layer from which activations are extracted for probe construction and SAE training in the target models
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Critique of activation-based interpretability methods.
- Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
- Empirical principle discovered during autoencoder training; led to using 8 billion training points
- Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
- Standard interpretability approach that VPD critiques and proposes an alternative to.
- The central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries
- Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
- Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.