finding

active

finding:saes-successfully-extract-sparse-feature-dictionaries-from-embeddings-of-sleepfm-reve-and-labram-eeg-transformers

SAEs successfully extract sparse feature dictionaries from embeddings of SleepFM, REVE, and LaBraM EEG transformers.

Foundational empirical result enabling all downstream analysis

Source paper

extracted_from

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

(2026) · William Lehn-Schiøler · Magnus Ruud Kjær · Rahul Thapa · M. Pedersen +9

Neighborhood — ranked by edge-count

Claims (1)

claim

A single SAE hyperparameter procedure driven by an intrinsic dictionary health audit transfers robustly across all three EEG transformer architectures.
supports
Key methodological contribution claim about architecture-agnostic SAE tuning

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Are the features extracted by SAEs from EEG transformers monosemantic or entangled?question0.832
Research question motivating the monosemanticity and entanglement benchmarking
Monosemanticity and entanglement of SAE features were benchmarked for clinical taxonomy grounding across SleepFM, REVE, LaBraM.finding0.826
Quantitative assessment of feature quality using clinical concepts across models.
Sparse Autoencoders (SAE)method0.810
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.799
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
TopK Sparse Autoencoders (SAEs)method0.799
Sparse dictionary learning method used to extract interpretable features from EEG transformer embeddings.
Sparse Autoencoders (SAE) activation-based paradigmframework0.790
Standard interpretability approach that VPD critiques and proposes an alternative to.
Sparse autoencoders produce interpretable features for large models.claim0.786
Central claim of the paper: the method scales to state-of-the-art transformers.
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.786
Claim that feature grounding enables interpretability metrics.