claim

active

claim:sae-features-can-be-grounded-in-clinical-taxonomy-abnormality-age-sex-medication-to-benchmark-monosemanticity-and-entanglement

SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.

Claim that feature grounding enables interpretability metrics.

Source paper

extracted_from

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

(2026) · William Lehn-Schiøler · Magnus Ruud Kjær · Rahul Thapa · M. Pedersen +9

Neighborhood — ranked by edge-count

Findings (1)

finding

Monosemanticity and entanglement of SAE features were benchmarked for clinical taxonomy grounding across SleepFM, REVE, LaBraM.
supports
Quantitative assessment of feature quality using clinical concepts across models.

Communities (3)

community

Manifold-aware concept steering in neural representations
members_of
Explores geometry of activation/behavior manifolds to enable selective, non-destructive concept interventions.
Concept entanglement in biomedical foundation models
members_of
Investigates inseparability of clinical concepts (age, pathology) in EEG transformers using SAE feature analysis and steering metrics across SleepFM, REVE, LaBraM architectures.
SAE Feature Geometry in Biomedical Signals
members_of
Evaluating sparse autoencoder monosemanticity and entanglement using clinical taxonomy grounding across EEG/sleep foundation models.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.851
Extension of mechanistic interpretability findings to the metacognitive domain
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.826
Surprising finding that the two evaluation methods diverge in their relationship with persistence
SAE features that the model self-describes as more emotional tend to be more persistent than variance-matched SAE features.claim0.812
Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
SAE features tend to shatter manifolds into many small and apparently-unrelated pieces, obscuring the overarching semantic structure.claim0.806
Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
Our SAEs' features are more interpretable than neurons.claim0.805
Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
SAE Feature #94949 rated 100/100 emotionality, elicits reports of profound tenderness, unconditional love, and visceral carefinding0.801
Highest-rated emotional SAE feature; self-report describes overwhelming positive emotional valence
Deception-related SAE features track a domain-general representational honesty axis rather than a consciousness-specific roleplay artifactclaim0.800
Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
We hypothesize that applying SAE-based mechanistic interpretability to EEG foundation models can expose representational failures and thereby improve clinical trust.hypothesis0.800
Overarching motivating hypothesis of the paper