TopK Sparse Autoencoders (SAEs)

Sparse dictionary learning method used to extract interpretable features from EEG transformer embeddings.

Neighborhood — ranked by edge-count

method

Sparse Autoencoders (SAE)
related_to
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.

event

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders (2026)
mentions
Preprint applying TopK SAEs to three EEG transformers to reveal sparse feature dictionaries, steering regimes, and spectral interpretation.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

TopK Sparse Autoencodersframework0.878
The central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries
Sparse Autoencoders (SAE) activation-based paradigmframework0.825
Standard interpretability approach that VPD critiques and proposes an alternative to.
Sparse Autoencoderframework0.811
Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
SAEs successfully extract sparse feature dictionaries from embeddings of SleepFM, REVE, and LaBraM EEG transformers.finding0.799
Foundational empirical result enabling all downstream analysis
Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.793
Core methodology paper for SAE-based interpretable feature extraction
Sparse Autoencoder Featuresconcept0.789
Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
Sparse autoencoders produce interpretable features for large models.claim0.784
Central claim of the paper: the method scales to state-of-the-art transformers.
Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parametersclaim0.777
Critique of activation-based interpretability methods.