Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

ByWilliam Lehn-Schiøler·Magnus Ruud Kjær·Rahul Thapa ⓘ·M. Pedersen·Anton Storgaard Mosquera·Nick Williams+7 more

DOI 10.48550/arxiv.2605.13930 arXiv 2605.13930 OpenAlex W7161273317

Mechanistic Interpretability LaBraM Intrinsic Dictionary Health Audit Three Operational Regimes of Steering REVE Spectral Decoder wrecking-ball intervention SleepFM Target vs. Off-Target Probe Area Metric TopK Sparse Autoencoders

TL;DR

Applying TopK Sparse Autoencoders (SAEs) to three architecturally distinct EEG foundation models — SleepFM, REVE, and LaBraM — reveals that clinical concepts are not cleanly separable in these models' latent spaces, with age-pathology confounding emerging as a structural failure mode rather than a tuning artifact. A single hyperparameter procedure guided by an intrinsic dictionary health audit transfers robustly across all three architectures without per-model recalibration. The paper introduces a 'target vs. off-target' probe area metric for concept steering, which operationalizes steering selectivity and exposes three distinct regimes: selectively steerable, encoded but entangled, and non-encoded. Critically, some interventions act as 'wrecking-ball' manipulations that collapse global model performance, meaning targeted suppression of a single clinical concept is impossible without corrupting the broader representation. A spectral decoder then maps latent interventions back to physiologically interpretable frequency signatures — including pathological slow-wave suppression and α-band restoration — grounding abstract latent operations in clinically recognizable EEG phenomena. Benchmarked against a clinical taxonomy spanning abnormality, age, sex, and medication, the framework quantifies monosemanticity and entanglement across architectures. The paper argues this implies that current EEG foundation models carry embedded clinical confounds that are mechanistically inseparable, posing a direct barrier to safe deployment in diagnostic settings without architectural changes that enforce disentanglement.

What to take away

1. TopK Sparse Autoencoders applied to the embeddings of SleepFM, REVE, and LaBraM extract sparse feature dictionaries that can be benchmarked for monosemanticity and entanglement using a four-concept clinical taxonomy: abnormality, age, sex, and medication.
2. A single SAE hyperparameter selection procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architecturally distinct EEG transformer models without requiring per-architecture re-tuning.
3. Concept steering experiments reveal three operational regimes — selectively steerable, encoded but entangled, and non-encoded — indicating that not all clinically relevant features can be independently manipulated even when they are detectably represented.
4. The paper introduces a 'target vs. off-target' probe area metric as a quantitative measure of steering selectivity, enabling comparison of intervention precision across architectures and concepts.
5. Age-pathology confounding is identified as a structural entanglement where suppressing the age concept in the latent space inevitably corrupts the pathology representation, making clean disentanglement impossible within the current model families.
6. 'Wrecking-ball' interventions — concept steering operations that collapse global model performance rather than selectively suppressing a target concept — are demonstrated empirically across at least one of the three tested architectures.
7. A spectral decoder maps latent concept steering interventions back to the amplitude spectrum, producing physiologically interpretable outputs including pathological slow-wave suppression and α-band restoration.
8. To replicate the dictionary health audit methodology, a researcher would train TopK SAEs on frozen embeddings extracted from each model, then apply the intrinsic audit criterion to select the SAE hyperparameter before any downstream probing or steering.
9. An open question the paper raises is whether architectural modifications that explicitly enforce disentanglement during EEG foundation model pre-training could eliminate the structural clinical confounds identified here, rather than relying on post-hoc SAE analysis.
10. The framework benchmarks monosemanticity and entanglement across SleepFM, REVE, and LaBraM using the same clinical taxonomy, providing a cross-architecture comparison of representational quality that did not previously exist for EEG foundation models.

Peer brief — for seminar discussion

Working across three EEG transformer foundation models — SleepFM, REVE, and LaBraM — this work applies TopK Sparse Autoencoders (SAEs) to extract interpretable feature dictionaries from frozen model embeddings, then probes those dictionaries against a four-concept clinical taxonomy (abnormality, age, sex, medication) using a newly introduced 'target vs. off-target' probe area metric for concept steering. The SAE hyperparameter is selected via an intrinsic dictionary health audit, a procedure that transfers across all three architectures without per-model adjustment, which is non-trivial given how differently SleepFM, REVE, and LaBraM are structured. The load-bearing finding is that clinical concepts in these models are not cleanly disentangled at the representational level. Steering experiments reveal three regimes: selectively steerable concepts that respond to targeted intervention, encoded but entangled concepts where suppression of one inevitably corrupts another, and non-encoded concepts that are absent from the latent space entirely. Age-pathology confounding is the clearest pathological case — it is structurally impossible, within these models, to suppress age representation without corrupting pathology representation. Additionally, some steering attempts produce 'wrecking-ball' interventions that degrade global model performance rather than achieving selective concept suppression. A spectral decoder closes the loop by translating latent interventions into amplitude-spectrum signatures such as slow-wave suppression and α-band restoration, making the analysis legible in physiological terms. The implication is clinical and architectural: EEG foundation models as currently trained carry embedded confounds that are mechanistically inseparable, which poses a direct challenge to their trustworthy deployment in diagnostic pipelines. The paper implicitly predicts that pre-training objectives enforcing disentanglement would be required to escape these failure modes, rather than post-hoc interpretability patching. The most contestable aspect is the causal interpretation of steering results. The probe area metric quantifies how selectively a latent direction can be perturbed, but a critical reader would push back on whether concept steering via SAE feature manipulation actually licenses conclusions about the model's internal causal structure, as opposed to reflecting the geometry of the probe's linear readout — a distinction the analysis does not fully resolve. An alternative methodology would have been activation patching or causal tracing (as used in LLM mechanistic interpretability), which would provide a more direct handle on causal mediation but is harder to apply at the scale of continuous EEG embeddings. The scope is also limited to three models from one data modality, leaving open how broadly the three-regime taxonomy generalizes to other biosignal or clinical foundation models.

Methods (3)

Intrinsic Dictionary Health Audit
A hyperparameter selection procedure driven by intrinsic measures of SAE dictionary quality that transfers across architectures
Spectral Decoder
Method that maps latent concept steering interventions back to EEG amplitude spectrum to obtain physiologically interpretable frequency signatures.
Target vs. Off-Target Probe Area Metric
Metric introduced to quantify steering selectivity by comparing the area of target and off-target concept probes.

Frameworks (4)

LaBraM
EEG transformer foundation model for brain activity analysis, one of the three architectures studied.
REVE
EEG transformer foundation model (representation model) analyzed in the study.
SleepFM
EEG transformer foundation model for sleep staging, one of the three analyzed architectures.
TopK Sparse Autoencoders
The central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries

Findings (12)

Concept steering with target vs off-target probe area metric reveals three operational regimes (selectively steerable, encoded but entangled, non-encoded) across SleepFM, REVE, LaBraM.
Result categorizing concept steerability into three distinct regimes.
Age-pathology confounding is empirically demonstrated: suppressing age representation corrupts pathology representation in EEG foundation models.
Specific instance of clinical entanglement with patient safety implications
Spectral decoder reveals pathological slow-wave suppression as a frequency signature of concept steering interventions in EEG foundation models.
Links latent space manipulation to known EEG neurophysiology
A single hyperparameter procedure driven by the intrinsic dictionary health audit transfers robustly across SleepFM, REVE, and LaBraM.
Demonstrates architecture-agnostic applicability of the SAE tuning method
Spectral decoder reveals α-band restoration as a frequency signature of concept steering interventions in EEG foundation models.
Links latent space manipulation to known EEG neurophysiology
Monosemanticity and entanglement of SAE features were benchmarked for clinical taxonomy grounding across SleepFM, REVE, LaBraM.
Quantitative assessment of feature quality using clinical concepts across models.
Concept steering experiments identify three distinct operational regimes across clinical concepts in EEG foundation models.
Main empirical finding of the concept steering analysis
Wrecking-ball interventions that collapse global model performance are empirically identified in EEG foundation models.
Demonstrates a critical failure mode of concept steering with clinical safety implications
SAEs successfully extract sparse feature dictionaries from embeddings of SleepFM, REVE, and LaBraM EEG transformers.
Foundational empirical result enabling all downstream analysis
Concept interventions on some concepts act as 'wrecking-ball' interventions, collapsing global model performance.
Observation of catastrophic performance drop when steering certain concepts.

Claims (10)

The spectral decoder successfully translates latent SAE interventions into physiologically interpretable frequency signatures such as slow-wave suppression and α-band restoration.
Key result linking abstract latent manipulations to known EEG neurophysiology
EEG foundation models achieve state-of-the-art clinical performance yet their internal computations remain opaque, constituting a barrier to clinical trust.
Motivating claim for the entire paper
Some SAE concept steering interventions act as 'wrecking balls' that collapse global model performance rather than selectively modifying target concepts.
A critical failure mode identified in the paper demonstrating risk of naïve concept steering
A single SAE hyperparameter procedure driven by an intrinsic dictionary health audit transfers robustly across all three EEG transformer architectures.
Key methodological contribution claim about architecture-agnostic SAE tuning
Age and pathology are clinically entangled in EEG foundation model representations such that suppressing one concept inevitably corrupts the other.
A specific representational failure with direct clinical safety implications
The target vs. off-target probe area metric quantifies steering selectivity and distinguishes selectively steerable from entangled interventions.
Justification for the novel metric introduced in the paper
Clinical concepts in EEG foundation models fall into three operational regimes: selectively steerable, encoded but entangled, and non-encoded.
Interpretive claim summarizing the spectrum of concept steerability discovered.
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.
Claim that feature grounding enables interpretability metrics.
Spectral decoding of concept interventions can provide physiologically interpretable frequency signatures.
Claim that the spectral decoder adds physiological interpretability.
Age-pathology confounding prevents independent steering of age and pathology concepts.
Interpretive assertion about clinical entanglement in the representations.

Hypotheses (1)

We hypothesize that applying SAE-based mechanistic interpretability to EEG foundation models can expose representational failures and thereby improve clinical trust.
Overarching motivating hypothesis of the paper

Questions (6)

What physiologically interpretable frequency signatures correspond to latent concept steering manipulations in EEG foundation models?
Research question motivating the spectral decoder methodology
Can concept steering interventions on EEG foundation models be made selective rather than globally destructive?
Research question motivating the introduction of the probe area metric and identification of operational regimes
What clinical concepts are encoded in the internal representations of EEG foundation models?
Primary research question driving the extraction and benchmarking of SAE features
Can clinical concepts be selectively steered without damaging unrelated performance?
Question about the feasibility of safe concept steering in EEG models.
Are the features extracted by SAEs from EEG transformers monosemantic or entangled?
Research question motivating the monosemanticity and entanglement benchmarking
How are clinical concepts represented and steerable in EEG foundation models?
Core research question driving the mechanistic investigation.

Original abstract (expand)

EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: "wrecking-ball" interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and $α$-band restoration.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders
Mudith Jayasekara, Max Kirkby Charles O'Neill
2025
≈ 86%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 86%
Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang Xiangchen Song
2025
≈ 85%
What Do EEG Foundation Models Capture from Human Brain Signals?
Qian Chen, Jilin Mei, Houshi Xu, Quanshi Zhang, Jing Shao, Na Zou, Xia Hu, Dongrui Liu Ling Tang
2026
≈ 85%
Insights into a radiology-specialised multimodal large language model with sparse autoencoders
Shruthi Bannur, Felix Meissen, Daniel Coelho de Castro, Anton Schwaighofer, Javier Alvarez-Valle, Stephanie L. Hyland Kenza Bouzid
2025
≈ 85%
Learning biologically relevant features in a pathology foundation model using sparse autoencoders
Ciyue Shen, Neel Patel, Chintan Shah, Darpan Sanghavi, Blake Martin, Alfred Eng, Daniel Shenker, Harshith Padigela, Raymond Biju, Syed Ashar Javed, Jennifer Hipp, John Abel, Harsha Pokkalla, Sean Grullon, Dinkar Juyal Nhat Minh Le
2024
≈ 85%
Sparse Autoencoder Decomposition of Clinical Sequence Model Representations: Feature Complexity, Task Specialisation, and Mortality Prediction
Feng Dong, Andreas Karwath Chris Sainsbury
2026
≈ 85%
Mechanistic Interpretability of Antibody Language Models Using SAEs
Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Anna L. Beukenhorst, Charlotte M. Deane Rebonto Haque
2026
≈ 85%
Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder
Zhen Tan, Song Wang, Kaidi Xu, Tianlong Chen Zhen Xu
2025
≈ 84%
Supervised sparse auto-encoders for interpretable and compositional representations
Hugo Wallner, Yoonsoo Nam, Haixuan Xavier Tao Ouns El Harzli
2026
≈ 84%
A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Large Language Models
Tiago Azevedo, Cristian Cosentino, Chiara D'Ercoli, Subati Abulikemu, Zhongtian Sun, Richard Bethlehem, Pietro Lio Michail Mamalakis
2026
≈ 84%
A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, Mengnan Du Dong Shu
2025
≈ 84%
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
Maheep Chaudhary and Atticus Geiger
2024
≈ 84%
Constructing Interpretable Features from Compositional Neuron Groups
Atticus Geiger, Mor Geva Or Shafran
2026
≈ 84%
Measuring and Guiding Monosemanticity
Felix Friedrich, Manuel Brack, Stephan W\"aldchen, Bj\"orn Deiseroth, Patrick Schramowski, Kristian Kersting Ruben H\"arle
2025
≈ 84%
SAE-V: Interpreting Multimodal Models for Enhanced Alignment
Changye Li, Jiaming Ji, Yaodong Yang Hantao Lou
2025
≈ 84%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 84%
Interpreting Language Model Parameters
in corpus
2026
≈ 83%
Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet
cited
≈ 82%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 82%
Anima Labs Phenomenology Pt1
in corpus
≈ 82%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 81%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 81%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 80%
Unveiling the Latent Directions of Reflection in Large Language Models
in corpus
2025
≈ 80%
Model Alignment Search
in corpus
2025
≈ 80%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 80%
A Mathematical Framework for Transformer Circuits
cited
2021
≈ 76%
Sparse autoencoders find highly interpretable features in language models
cited
2023
≈ 73%
Towards monosemanticity: Decomposing language models with dictionary learning
cited
2023
≈ 67%

+25 more

Similar preprints — Semantic Scholar

Cross-corpus bridges (1)

same_concept_as · Nomic cosine

External markdown files that talk about the same concept as this entity.

alexander
Agent: APPLIED Extractiontmp/agent-applied-2026-05-09.md0.763