framework
active
framework:topk-sparse-autoencodersTopK Sparse Autoencoders
The central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (4)
concept
- EntanglementimplementsLess hierarchical than embedment; multiple texts work into and out of each other, creating associations across levels and connecting any single text to the matrix of all others.
- monosemanticityimplementsInterpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
- EEG Transformer EmbeddingsimplementsThe internal representations of EEG transformers from which SAE features are extracted
- Sparse Feature DictionaryintroducesThe extracted set of sparse interpretable features from model embeddings via SAEs
Frameworks (1)
framework
- Sparse Autoencoderrelated_toInterpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Sparse dictionary learning method used to extract interpretable features from EEG transformer embeddings.
- Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
- Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
- Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
- Neural network architecture that learns compressed representations; SOHMs are functionally equivalent.
- Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.802Core methodology paper for SAE-based interpretable feature extraction
- Central claim of the paper: the method scales to state-of-the-art transformers.