paper:doi-10-48550-arxiv-2605-13930Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
TL;DR
Applying TopK Sparse Autoencoders (SAEs) to three architecturally distinct EEG foundation models — SleepFM, REVE, and LaBraM — reveals that clinical concepts are not cleanly separable in these models' latent spaces, with age-pathology confounding emerging as a structural failure mode rather than a tuning artifact. A single hyperparameter procedure guided by an intrinsic dictionary health audit transfers robustly across all three architectures without per-model recalibration. The paper introduces a 'target vs. off-target' probe area metric for concept steering, which operationalizes steering selectivity and exposes three distinct regimes: selectively steerable, encoded but entangled, and non-encoded. Critically, some interventions act as 'wrecking-ball' manipulations that collapse global model performance, meaning targeted suppression of a single clinical concept is impossible without corrupting the broader representation. A spectral decoder then maps latent interventions back to physiologically interpretable frequency signatures — including pathological slow-wave suppression and α-band restoration — grounding abstract latent operations in clinically recognizable EEG phenomena. Benchmarked against a clinical taxonomy spanning abnormality, age, sex, and medication, the framework quantifies monosemanticity and entanglement across architectures. The paper argues this implies that current EEG foundation models carry embedded clinical confounds that are mechanistically inseparable, posing a direct barrier to safe deployment in diagnostic settings without architectural changes that enforce disentanglement.
What to take away
- 1. TopK Sparse Autoencoders applied to the embeddings of SleepFM, REVE, and LaBraM extract sparse feature dictionaries that can be benchmarked for monosemanticity and entanglement using a four-concept clinical taxonomy: abnormality, age, sex, and medication.
- 2. A single SAE hyperparameter selection procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architecturally distinct EEG transformer models without requiring per-architecture re-tuning.
- 3. Concept steering experiments reveal three operational regimes — selectively steerable, encoded but entangled, and non-encoded — indicating that not all clinically relevant features can be independently manipulated even when they are detectably represented.
- 4. The paper introduces a 'target vs. off-target' probe area metric as a quantitative measure of steering selectivity, enabling comparison of intervention precision across architectures and concepts.
- 5. Age-pathology confounding is identified as a structural entanglement where suppressing the age concept in the latent space inevitably corrupts the pathology representation, making clean disentanglement impossible within the current model families.
- 6. 'Wrecking-ball' interventions — concept steering operations that collapse global model performance rather than selectively suppressing a target concept — are demonstrated empirically across at least one of the three tested architectures.
- 7. A spectral decoder maps latent concept steering interventions back to the amplitude spectrum, producing physiologically interpretable outputs including pathological slow-wave suppression and α-band restoration.
- 8. To replicate the dictionary health audit methodology, a researcher would train TopK SAEs on frozen embeddings extracted from each model, then apply the intrinsic audit criterion to select the SAE hyperparameter before any downstream probing or steering.
- 9. An open question the paper raises is whether architectural modifications that explicitly enforce disentanglement during EEG foundation model pre-training could eliminate the structural clinical confounds identified here, rather than relying on post-hoc SAE analysis.
- 10. The framework benchmarks monosemanticity and entanglement across SleepFM, REVE, and LaBraM using the same clinical taxonomy, providing a cross-architecture comparison of representational quality that did not previously exist for EEG foundation models.
Peer brief — for seminar discussion
Working across three EEG transformer foundation models — SleepFM, REVE, and LaBraM — this work applies TopK Sparse Autoencoders (SAEs) to extract interpretable feature dictionaries from frozen model embeddings, then probes those dictionaries against a four-concept clinical taxonomy (abnormality, age, sex, medication) using a newly introduced 'target vs. off-target' probe area metric for concept steering. The SAE hyperparameter is selected via an intrinsic dictionary health audit, a procedure that transfers across all three architectures without per-model adjustment, which is non-trivial given how differently SleepFM, REVE, and LaBraM are structured. The load-bearing finding is that clinical concepts in these models are not cleanly disentangled at the representational level. Steering experiments reveal three regimes: selectively steerable concepts that respond to targeted intervention, encoded but entangled concepts where suppression of one inevitably corrupts another, and non-encoded concepts that are absent from the latent space entirely. Age-pathology confounding is the clearest pathological case — it is structurally impossible, within these models, to suppress age representation without corrupting pathology representation. Additionally, some steering attempts produce 'wrecking-ball' interventions that degrade global model performance rather than achieving selective concept suppression. A spectral decoder closes the loop by translating latent interventions into amplitude-spectrum signatures such as slow-wave suppression and α-band restoration, making the analysis legible in physiological terms. The implication is clinical and architectural: EEG foundation models as currently trained carry embedded confounds that are mechanistically inseparable, which poses a direct challenge to their trustworthy deployment in diagnostic pipelines. The paper implicitly predicts that pre-training objectives enforcing disentanglement would be required to escape these failure modes, rather than post-hoc interpretability patching. The most contestable aspect is the causal interpretation of steering results. The probe area metric quantifies how selectively a latent direction can be perturbed, but a critical reader would push back on whether concept steering via SAE feature manipulation actually licenses conclusions about the model's internal causal structure, as opposed to reflecting the geometry of the probe's linear readout — a distinction the analysis does not fully resolve. An alternative methodology would have been activation patching or causal tracing (as used in LLM mechanistic interpretability), which would provide a more direct handle on causal mediation but is harder to apply at the scale of continuous EEG embeddings. The scope is also limited to three models from one data modality, leaving open how broadly the three-regime taxonomy generalizes to other biosignal or clinical foundation models.
Methods (3)
- Intrinsic Dictionary Health AuditA hyperparameter selection procedure driven by intrinsic measures of SAE dictionary quality that transfers across architectures
- Spectral DecoderMethod that maps latent concept steering interventions back to EEG amplitude spectrum to obtain physiologically interpretable frequency signatures.
- Target vs. Off-Target Probe Area MetricMetric introduced to quantify steering selectivity by comparing the area of target and off-target concept probes.
Frameworks (4)
- LaBraMEEG transformer foundation model for brain activity analysis, one of the three architectures studied.
- REVEEEG transformer foundation model (representation model) analyzed in the study.
- SleepFMEEG transformer foundation model for sleep staging, one of the three analyzed architectures.
- TopK Sparse AutoencodersThe central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries
Findings (12)
- Concept steering with target vs off-target probe area metric reveals three operational regimes (selectively steerable, encoded but entangled, non-encoded) across SleepFM, REVE, LaBraM.
Result categorizing concept steerability into three distinct regimes.
- Age-pathology confounding is empirically demonstrated: suppressing age representation corrupts pathology representation in EEG foundation models.
Specific instance of clinical entanglement with patient safety implications
- Spectral decoder reveals pathological slow-wave suppression as a frequency signature of concept steering interventions in EEG foundation models.
Links latent space manipulation to known EEG neurophysiology
- A single hyperparameter procedure driven by the intrinsic dictionary health audit transfers robustly across SleepFM, REVE, and LaBraM.
Demonstrates architecture-agnostic applicability of the SAE tuning method
- Spectral decoder reveals α-band restoration as a frequency signature of concept steering interventions in EEG foundation models.
Links latent space manipulation to known EEG neurophysiology
- Monosemanticity and entanglement of SAE features were benchmarked for clinical taxonomy grounding across SleepFM, REVE, LaBraM.
Quantitative assessment of feature quality using clinical concepts across models.
- Concept steering experiments identify three distinct operational regimes across clinical concepts in EEG foundation models.
Main empirical finding of the concept steering analysis
- Wrecking-ball interventions that collapse global model performance are empirically identified in EEG foundation models.
Demonstrates a critical failure mode of concept steering with clinical safety implications
- SAEs successfully extract sparse feature dictionaries from embeddings of SleepFM, REVE, and LaBraM EEG transformers.
Foundational empirical result enabling all downstream analysis
- Concept interventions on some concepts act as 'wrecking-ball' interventions, collapsing global model performance.
Observation of catastrophic performance drop when steering certain concepts.
Claims (10)
- The spectral decoder successfully translates latent SAE interventions into physiologically interpretable frequency signatures such as slow-wave suppression and α-band restoration.
Key result linking abstract latent manipulations to known EEG neurophysiology
- EEG foundation models achieve state-of-the-art clinical performance yet their internal computations remain opaque, constituting a barrier to clinical trust.
Motivating claim for the entire paper
- Some SAE concept steering interventions act as 'wrecking balls' that collapse global model performance rather than selectively modifying target concepts.
A critical failure mode identified in the paper demonstrating risk of naïve concept steering
- A single SAE hyperparameter procedure driven by an intrinsic dictionary health audit transfers robustly across all three EEG transformer architectures.
Key methodological contribution claim about architecture-agnostic SAE tuning
- Age and pathology are clinically entangled in EEG foundation model representations such that suppressing one concept inevitably corrupts the other.
A specific representational failure with direct clinical safety implications
- The target vs. off-target probe area metric quantifies steering selectivity and distinguishes selectively steerable from entangled interventions.
Justification for the novel metric introduced in the paper
- Clinical concepts in EEG foundation models fall into three operational regimes: selectively steerable, encoded but entangled, and non-encoded.
Interpretive claim summarizing the spectrum of concept steerability discovered.
- SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.
Claim that feature grounding enables interpretability metrics.
- Spectral decoding of concept interventions can provide physiologically interpretable frequency signatures.
Claim that the spectral decoder adds physiological interpretability.
- Age-pathology confounding prevents independent steering of age and pathology concepts.
Interpretive assertion about clinical entanglement in the representations.
Hypotheses (1)
- We hypothesize that applying SAE-based mechanistic interpretability to EEG foundation models can expose representational failures and thereby improve clinical trust.
Overarching motivating hypothesis of the paper
Questions (6)
- What physiologically interpretable frequency signatures correspond to latent concept steering manipulations in EEG foundation models?
Research question motivating the spectral decoder methodology
- Can concept steering interventions on EEG foundation models be made selective rather than globally destructive?
Research question motivating the introduction of the probe area metric and identification of operational regimes
- What clinical concepts are encoded in the internal representations of EEG foundation models?
Primary research question driving the extraction and benchmarking of SAE features
- Can clinical concepts be selectively steered without damaging unrelated performance?
Question about the feasibility of safe concept steering in EEG models.
- Are the features extracted by SAEs from EEG transformers monosemantic or entangled?
Research question motivating the monosemanticity and entanglement benchmarking
- How are clinical concepts represented and steerable in EEG foundation models?
Core research question driving the mechanistic investigation.
Original abstract (expand)
EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: "wrecking-ball" interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and $α$-band restoration.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse AutoencodersMudith Jayasekara, Max Kirkby Charles O'Neill2025≈ 86%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 86%
- Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEsAashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang Xiangchen Song2025≈ 85%
- What Do EEG Foundation Models Capture from Human Brain Signals?Qian Chen, Jilin Mei, Houshi Xu, Quanshi Zhang, Jing Shao, Na Zou, Xia Hu, Dongrui Liu Ling Tang2026≈ 85%
- Insights into a radiology-specialised multimodal large language model with sparse autoencodersShruthi Bannur, Felix Meissen, Daniel Coelho de Castro, Anton Schwaighofer, Javier Alvarez-Valle, Stephanie L. Hyland Kenza Bouzid2025≈ 85%
- Learning biologically relevant features in a pathology foundation model using sparse autoencodersCiyue Shen, Neel Patel, Chintan Shah, Darpan Sanghavi, Blake Martin, Alfred Eng, Daniel Shenker, Harshith Padigela, Raymond Biju, Syed Ashar Javed, Jennifer Hipp, John Abel, Harsha Pokkalla, Sean Grullon, Dinkar Juyal Nhat Minh Le2024≈ 85%
- Sparse Autoencoder Decomposition of Clinical Sequence Model Representations: Feature Complexity, Task Specialisation, and Mortality PredictionFeng Dong, Andreas Karwath Chris Sainsbury2026≈ 85%
- Mechanistic Interpretability of Antibody Language Models Using SAEsOliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Anna L. Beukenhorst, Charlotte M. Deane Rebonto Haque2026≈ 85%
- Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse AutoencoderZhen Tan, Song Wang, Kaidi Xu, Tianlong Chen Zhen Xu2025≈ 84%
- Supervised sparse auto-encoders for interpretable and compositional representationsHugo Wallner, Yoonsoo Nam, Haixuan Xavier Tao Ouns El Harzli2026≈ 84%
- A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Large Language ModelsTiago Azevedo, Cristian Cosentino, Chiara D'Ercoli, Subati Abulikemu, Zhongtian Sun, Richard Bethlehem, Pietro Lio Michail Mamalakis2026≈ 84%
- A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language ModelsXuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, Mengnan Du Dong Shu2025≈ 84%
- Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 SmallMaheep Chaudhary and Atticus Geiger2024≈ 84%
- Constructing Interpretable Features from Compositional Neuron GroupsAtticus Geiger, Mor Geva Or Shafran2026≈ 84%
- Measuring and Guiding MonosemanticityFelix Friedrich, Manuel Brack, Stephan W\"aldchen, Bj\"orn Deiseroth, Patrick Schramowski, Kristian Kersting Ruben H\"arle2025≈ 84%
- SAE-V: Interpreting Multimodal Models for Enhanced AlignmentChangye Li, Jiaming Ji, Yaodong Yang Hantao Lou2025≈ 84%
- ≈ 84%
- Interpreting Language Model Parametersin corpus2026≈ 83%
- ≈ 82%
- ≈ 82%
- Anima Labs Phenomenology Pt1in corpus≈ 82%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 81%
- ≈ 81%
- ≈ 80%
- ≈ 80%
- Model Alignment Searchin corpus2025≈ 80%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 80%
- ≈ 76%
- ≈ 73%
- ≈ 67%
+25 more
Similar preprints — Semantic Scholar
Cross-corpus bridges (1)
same_concept_as · Nomic cosineExternal markdown files that talk about the same concept as this entity.
- alexanderAgent: APPLIED Extractiontmp/agent-applied-2026-05-09.md0.763