event
active
event:mechanistic-interpretability-of-eeg-foundation-models-via-sparse-autoencoders-2026Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders (2026)
Preprint applying TopK SAEs to three EEG transformers to reveal sparse feature dictionaries, steering regimes, and spectral interpretation.
Neighborhood — ranked by edge-count
Thinkers (12)
thinker
- James Zouauthored
- Lars Kai Hansenauthored
- Magnus Ruud Kjærauthored
- Nick Williamsauthored
- Radu Gatejauthored
- Rahul Thapaauthored
- Sadasivan Puthusserypadyauthored
- Sándor Beniczkyauthored
- Tue Lehn-Schiølerauthored
- William Lehn-Schiølerauthored
- Anton Mosquera Storgaardauthored
- Magnus Guldberg Pedersenauthored
Concepts (12)
concept
- EntanglementmentionsLess hierarchical than embedment; multiple texts work into and out of each other, creating associations across levels and connecting any single text to the matrix of all others.
- monosemanticitymentionsInterpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
- EEG foundation modelsmentionsLarge transformer models pretrained on EEG data for clinical tasks; the object of interpretability in this paper.
- wrecking-ball interventionmentionsType of concept steering intervention that catastrophically collapses global model performance.
- age-pathology confoundingmentionsEntanglement phenomenon where age and pathology concepts cannot be independently steered without corrupting each other.
- Set of clinical concepts used as a grounding vocabulary to benchmark SAE feature monosemanticity and entanglement.
- EEG frequency signature of reduced slow-wave activity, obtained as a spectral interpretation of steering.
- α-band restorationmentionsRestoration of alpha band (8–12 Hz) power in EEG, a physiological signature obtained from spectral decoding.
- abnormalitymentionsEEG abnormality concept (e.g., epileptiform activity) used to interpret SAE features.
- agementionsPatient age concept used to interpret SAE features.
- medicationmentionsMedication status concept used to interpret SAE features.
- sexmentionsBiological sex concept used to interpret SAE features.
Methods (5)
method
- Concept SteeringmentionsLatent intervention technique that manipulates sparse features to steer model predictions toward desired concepts.
- Spectral DecoderintroducesMethod that maps latent concept steering interventions back to EEG amplitude spectrum to obtain physiologically interpretable frequency signatures.
- TopK Sparse Autoencoders (SAEs)mentionsSparse dictionary learning method used to extract interpretable features from EEG transformer embeddings.
- Target vs. Off-Target Probe Area MetricintroducesMetric introduced to quantify steering selectivity by comparing the area of target and off-target concept probes.
- Dictionary Health AuditintroducesIntrinsic hyperparameter selection procedure based on dictionary quality metrics; introduced in this paper to transfer across architectures.
Frameworks (3)
framework
- LaBraMmentionsEEG transformer foundation model for brain activity analysis, one of the three architectures studied.
- REVEmentionsEEG transformer foundation model (representation model) analyzed in the study.
- SleepFMmentionsEEG transformer foundation model for sleep staging, one of the three analyzed architectures.