TopK Sparse Autoencoders

The central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries

Neighborhood — ranked by edge-count

paper

concept

Entanglement
implements
Less hierarchical than embedment; multiple texts work into and out of each other, creating associations across levels and connecting any single text to the matrix of all others.
monosemanticity
implements
Interpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
EEG Transformer Embeddings
implements
The internal representations of EEG transformers from which SAE features are extracted
Sparse Feature Dictionary
introduces
The extracted set of sparse interpretable features from model embeddings via SAEs

framework

Sparse Autoencoder
related_to
Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

TopK Sparse Autoencoders (SAEs)method0.878
Sparse dictionary learning method used to extract interpretable features from EEG transformer embeddings.
Sparse Autoencoder Featuresconcept0.852
Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
Sparse Autoencoders (SAE)method0.817
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Sparse Autoencoder for Dictionary Learningframework0.811
Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
Autoencoderconcept0.809
Neural network architecture that learns compressed representations; SOHMs are functionally equivalent.
Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.802
Core methodology paper for SAE-based interpretable feature extraction
Sparse autoencoders produce interpretable features for large models.claim0.799
Central claim of the paper: the method scales to state-of-the-art transformers.
Deep Autoencoderframework0.793