method
active
method:topk-sparse-autoencoders-saesTopK Sparse Autoencoders (SAEs)
Sparse dictionary learning method used to extract interpretable features from EEG transformer embeddings.
Neighborhood — ranked by edge-count
Methods (1)
method
- Sparse Autoencoders (SAE)related_toInterpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Events (1)
event
- Preprint applying TopK SAEs to three EEG transformers to reveal sparse feature dictionaries, steering regimes, and spectral interpretation.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries
- Standard interpretability approach that VPD critiques and proposes an alternative to.
- Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
- Foundational empirical result enabling all downstream analysis
- Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.793Core methodology paper for SAE-based interpretable feature extraction
- Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
- Central claim of the paper: the method scales to state-of-the-art transformers.
- Critique of activation-based interpretability methods.