question

active

question:are-the-features-extracted-by-saes-from-eeg-transformers-monosemantic-or-entangled

Are the features extracted by SAEs from EEG transformers monosemantic or entangled?

Research question motivating the monosemanticity and entanglement benchmarking

Source paper

extracted_from

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

(2026) · William Lehn-Schiøler · Magnus Ruud Kjær · Rahul Thapa · M. Pedersen +9

Neighborhood — ranked by edge-count

Claims (1)

claim

Age and pathology are clinically entangled in EEG foundation model representations such that suppressing one concept inevitably corrupts the other.
gates
A specific representational failure with direct clinical safety implications

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAEs successfully extract sparse feature dictionaries from embeddings of SleepFM, REVE, and LaBraM EEG transformers.finding0.832
Foundational empirical result enabling all downstream analysis
Monosemanticity and entanglement of SAE features were benchmarked for clinical taxonomy grounding across SleepFM, REVE, LaBraM.finding0.804
Quantitative assessment of feature quality using clinical concepts across models.
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.793
Claim that feature grounding enables interpretability metrics.
A single SAE hyperparameter procedure driven by an intrinsic dictionary health audit transfers robustly across all three EEG transformer architectures.claim0.777
Key methodological contribution claim about architecture-agnostic SAE tuning
Our SAEs' features are more interpretable than neurons.claim0.776
Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.767
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
We hypothesize that applying SAE-based mechanistic interpretability to EEG foundation models can expose representational failures and thereby improve clinical trust.hypothesis0.752
Overarching motivating hypothesis of the paper
EEG Transformer Embeddingsconcept0.743
The internal representations of EEG transformers from which SAE features are extracted