finding

active

finding:monosemanticity-and-entanglement-of-sae-features-were-benchmarked-for-clinical-taxonomy-grounding-across-sleepfm-reve-labram

Monosemanticity and entanglement of SAE features were benchmarked for clinical taxonomy grounding across SleepFM, REVE, LaBraM.

Quantitative assessment of feature quality using clinical concepts across models.

Source paper

extracted_from

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

(2026) · William Lehn-Schiøler · Magnus Ruud Kjær · Rahul Thapa · M. Pedersen +9

Neighborhood — ranked by edge-count

Claims (1)

claim

SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.
supports
Claim that feature grounding enables interpretability metrics.

Communities (3)

community

Manifold-aware concept steering in neural representations
members_of
Explores geometry of activation/behavior manifolds to enable selective, non-destructive concept interventions.
Concept entanglement in biomedical foundation models
members_of
Investigates inseparability of clinical concepts (age, pathology) in EEG transformers using SAE feature analysis and steering metrics across SleepFM, REVE, LaBraM architectures.
SAE Feature Geometry in Biomedical Signals
members_of
Evaluating sparse autoencoder monosemanticity and entanglement using clinical taxonomy grounding across EEG/sleep foundation models.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAEs successfully extract sparse feature dictionaries from embeddings of SleepFM, REVE, and LaBraM EEG transformers.finding0.826
Foundational empirical result enabling all downstream analysis
Are the features extracted by SAEs from EEG transformers monosemantic or entangled?question0.804
Research question motivating the monosemanticity and entanglement benchmarking
Monosemantic Functional Featuresconcept0.792
Features that correspond to a single semantic concept and are effective for steering behavior.
monosemanticityconcept0.790
Interpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.783
SAE features are not simply mirroring individual neurons.
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.774
Extension of mechanistic interpretability findings to the metacognitive domain
Concept steering with target vs off-target probe area metric reveals three operational regimes (selectively steerable, encoded but entangled, non-encoded) across SleepFM, REVE, LaBraM.finding0.772
Result categorizing concept steerability into three distinct regimes.
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.771
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights