event

active

event:mechanistic-interpretability-of-eeg-foundation-models-via-sparse-autoencoders-2026

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders (2026)

Preprint applying TopK SAEs to three EEG transformers to reveal sparse feature dictionaries, steering regimes, and spectral interpretation.

Neighborhood — ranked by edge-count

Thinkers (12)

thinker

Concepts (12)

concept

Entanglement
mentions
Less hierarchical than embedment; multiple texts work into and out of each other, creating associations across levels and connecting any single text to the matrix of all others.
monosemanticity
mentions
Interpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
EEG foundation models
mentions
Large transformer models pretrained on EEG data for clinical tasks; the object of interpretability in this paper.
wrecking-ball intervention
mentions
Type of concept steering intervention that catastrophically collapses global model performance.
age-pathology confounding
mentions
Entanglement phenomenon where age and pathology concepts cannot be independently steered without corrupting each other.
Clinical Taxonomy: abnormality, age, sex, medication
mentions
Set of clinical concepts used as a grounding vocabulary to benchmark SAE feature monosemanticity and entanglement.
pathological slow-wave suppression
mentions
EEG frequency signature of reduced slow-wave activity, obtained as a spectral interpretation of steering.
α-band restoration
mentions
Restoration of alpha band (8–12 Hz) power in EEG, a physiological signature obtained from spectral decoding.
abnormality
mentions
EEG abnormality concept (e.g., epileptiform activity) used to interpret SAE features.
age
mentions
Patient age concept used to interpret SAE features.
medication
mentions
Medication status concept used to interpret SAE features.
sex
mentions
Biological sex concept used to interpret SAE features.

Methods (5)

method

Concept Steering
mentions
Latent intervention technique that manipulates sparse features to steer model predictions toward desired concepts.
Spectral Decoder
introduces
Method that maps latent concept steering interventions back to EEG amplitude spectrum to obtain physiologically interpretable frequency signatures.
TopK Sparse Autoencoders (SAEs)
mentions
Sparse dictionary learning method used to extract interpretable features from EEG transformer embeddings.
Target vs. Off-Target Probe Area Metric
introduces
Metric introduced to quantify steering selectivity by comparing the area of target and off-target concept probes.
Dictionary Health Audit
introduces
Intrinsic hyperparameter selection procedure based on dictionary quality metrics; introduced in this paper to transfer across architectures.

Frameworks (3)

framework

LaBraM
mentions
EEG transformer foundation model for brain activity analysis, one of the three architectures studied.
REVE
mentions
EEG transformer foundation model (representation model) analyzed in the study.
SleepFM
mentions
EEG transformer foundation model for sleep staging, one of the three analyzed architectures.