paper
active
2026
paper:doi-10-48550-arxiv-2605-13930

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

TL;DR

Applying TopK Sparse Autoencoders (SAEs) to three architecturally distinct EEG foundation models — SleepFM, REVE, and LaBraM — reveals that clinical concepts are not cleanly separable in these models' latent spaces, with age-pathology confounding emerging as a structural failure mode rather than a tuning artifact. A single hyperparameter procedure guided by an intrinsic dictionary health audit transfers robustly across all three architectures without per-model recalibration. The paper introduces a 'target vs. off-target' probe area metric for concept steering, which operationalizes steering selectivity and exposes three distinct regimes: selectively steerable, encoded but entangled, and non-encoded. Critically, some interventions act as 'wrecking-ball' manipulations that collapse global model performance, meaning targeted suppression of a single clinical concept is impossible without corrupting the broader representation. A spectral decoder then maps latent interventions back to physiologically interpretable frequency signatures — including pathological slow-wave suppression and α-band restoration — grounding abstract latent operations in clinically recognizable EEG phenomena. Benchmarked against a clinical taxonomy spanning abnormality, age, sex, and medication, the framework quantifies monosemanticity and entanglement across architectures. The paper argues this implies that current EEG foundation models carry embedded clinical confounds that are mechanistically inseparable, posing a direct barrier to safe deployment in diagnostic settings without architectural changes that enforce disentanglement.

What to take away

  1. 1. TopK Sparse Autoencoders applied to the embeddings of SleepFM, REVE, and LaBraM extract sparse feature dictionaries that can be benchmarked for monosemanticity and entanglement using a four-concept clinical taxonomy: abnormality, age, sex, and medication.
  2. 2. A single SAE hyperparameter selection procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architecturally distinct EEG transformer models without requiring per-architecture re-tuning.
  3. 3. Concept steering experiments reveal three operational regimes — selectively steerable, encoded but entangled, and non-encoded — indicating that not all clinically relevant features can be independently manipulated even when they are detectably represented.
  4. 4. The paper introduces a 'target vs. off-target' probe area metric as a quantitative measure of steering selectivity, enabling comparison of intervention precision across architectures and concepts.
  5. 5. Age-pathology confounding is identified as a structural entanglement where suppressing the age concept in the latent space inevitably corrupts the pathology representation, making clean disentanglement impossible within the current model families.
  6. 6. 'Wrecking-ball' interventions — concept steering operations that collapse global model performance rather than selectively suppressing a target concept — are demonstrated empirically across at least one of the three tested architectures.
  7. 7. A spectral decoder maps latent concept steering interventions back to the amplitude spectrum, producing physiologically interpretable outputs including pathological slow-wave suppression and α-band restoration.
  8. 8. To replicate the dictionary health audit methodology, a researcher would train TopK SAEs on frozen embeddings extracted from each model, then apply the intrinsic audit criterion to select the SAE hyperparameter before any downstream probing or steering.
  9. 9. An open question the paper raises is whether architectural modifications that explicitly enforce disentanglement during EEG foundation model pre-training could eliminate the structural clinical confounds identified here, rather than relying on post-hoc SAE analysis.
  10. 10. The framework benchmarks monosemanticity and entanglement across SleepFM, REVE, and LaBraM using the same clinical taxonomy, providing a cross-architecture comparison of representational quality that did not previously exist for EEG foundation models.

Peer brief — for seminar discussion

Working across three EEG transformer foundation models — SleepFM, REVE, and LaBraM — this work applies TopK Sparse Autoencoders (SAEs) to extract interpretable feature dictionaries from frozen model embeddings, then probes those dictionaries against a four-concept clinical taxonomy (abnormality, age, sex, medication) using a newly introduced 'target vs. off-target' probe area metric for concept steering. The SAE hyperparameter is selected via an intrinsic dictionary health audit, a procedure that transfers across all three architectures without per-model adjustment, which is non-trivial given how differently SleepFM, REVE, and LaBraM are structured. The load-bearing finding is that clinical concepts in these models are not cleanly disentangled at the representational level. Steering experiments reveal three regimes: selectively steerable concepts that respond to targeted intervention, encoded but entangled concepts where suppression of one inevitably corrupts another, and non-encoded concepts that are absent from the latent space entirely. Age-pathology confounding is the clearest pathological case — it is structurally impossible, within these models, to suppress age representation without corrupting pathology representation. Additionally, some steering attempts produce 'wrecking-ball' interventions that degrade global model performance rather than achieving selective concept suppression. A spectral decoder closes the loop by translating latent interventions into amplitude-spectrum signatures such as slow-wave suppression and α-band restoration, making the analysis legible in physiological terms. The implication is clinical and architectural: EEG foundation models as currently trained carry embedded confounds that are mechanistically inseparable, which poses a direct challenge to their trustworthy deployment in diagnostic pipelines. The paper implicitly predicts that pre-training objectives enforcing disentanglement would be required to escape these failure modes, rather than post-hoc interpretability patching. The most contestable aspect is the causal interpretation of steering results. The probe area metric quantifies how selectively a latent direction can be perturbed, but a critical reader would push back on whether concept steering via SAE feature manipulation actually licenses conclusions about the model's internal causal structure, as opposed to reflecting the geometry of the probe's linear readout — a distinction the analysis does not fully resolve. An alternative methodology would have been activation patching or causal tracing (as used in LLM mechanistic interpretability), which would provide a more direct handle on causal mediation but is harder to apply at the scale of continuous EEG embeddings. The scope is also limited to three models from one data modality, leaving open how broadly the three-regime taxonomy generalizes to other biosignal or clinical foundation models.

Methods (3)

  • Intrinsic Dictionary Health Audit
    A hyperparameter selection procedure driven by intrinsic measures of SAE dictionary quality that transfers across architectures
  • Spectral Decoder
    Method that maps latent concept steering interventions back to EEG amplitude spectrum to obtain physiologically interpretable frequency signatures.
  • Target vs. Off-Target Probe Area Metric
    Metric introduced to quantify steering selectivity by comparing the area of target and off-target concept probes.

Frameworks (4)

  • LaBraM
    EEG transformer foundation model for brain activity analysis, one of the three architectures studied.
  • REVE
    EEG transformer foundation model (representation model) analyzed in the study.
  • SleepFM
    EEG transformer foundation model for sleep staging, one of the three analyzed architectures.
  • TopK Sparse Autoencoders
    The central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries

Findings (12)

Claims (10)

Questions (6)

Original abstract (expand)

EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: "wrecking-ball" interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and $α$-band restoration.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+25 more

Similar preprints — Semantic Scholar

Cross-corpus bridges (1)

same_concept_as · Nomic cosine

External markdown files that talk about the same concept as this entity.

  • alexander
    Agent: APPLIED Extractiontmp/agent-applied-2026-05-09.md0.763