event
active
event:mechanistic-interpretability-of-eeg-foundation-models-via-sparse-autoencoders-2026

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders (2026)

Preprint applying TopK SAEs to three EEG transformers to reveal sparse feature dictionaries, steering regimes, and spectral interpretation.

Neighborhood — ranked by edge-count

Concepts (12)

concept
  • Entanglement
    mentions
    Less hierarchical than embedment; multiple texts work into and out of each other, creating associations across levels and connecting any single text to the matrix of all others.
  • Interpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
  • Large transformer models pretrained on EEG data for clinical tasks; the object of interpretability in this paper.
  • Type of concept steering intervention that catastrophically collapses global model performance.
  • Entanglement phenomenon where age and pathology concepts cannot be independently steered without corrupting each other.
  • Set of clinical concepts used as a grounding vocabulary to benchmark SAE feature monosemanticity and entanglement.
  • EEG frequency signature of reduced slow-wave activity, obtained as a spectral interpretation of steering.
  • Restoration of alpha band (8–12 Hz) power in EEG, a physiological signature obtained from spectral decoding.
  • abnormality
    mentions
    EEG abnormality concept (e.g., epileptiform activity) used to interpret SAE features.
  • age
    mentions
    Patient age concept used to interpret SAE features.
  • medication
    mentions
    Medication status concept used to interpret SAE features.
  • sex
    mentions
    Biological sex concept used to interpret SAE features.

Methods (5)

method
  • Latent intervention technique that manipulates sparse features to steer model predictions toward desired concepts.
  • Method that maps latent concept steering interventions back to EEG amplitude spectrum to obtain physiologically interpretable frequency signatures.
  • Sparse dictionary learning method used to extract interpretable features from EEG transformer embeddings.
  • Metric introduced to quantify steering selectivity by comparing the area of target and off-target concept probes.
  • Intrinsic hyperparameter selection procedure based on dictionary quality metrics; introduced in this paper to transfer across architectures.

Frameworks (3)

framework
  • LaBraM
    mentions
    EEG transformer foundation model for brain activity analysis, one of the three architectures studied.
  • REVE
    mentions
    EEG transformer foundation model (representation model) analyzed in the study.
  • SleepFM
    mentions
    EEG transformer foundation model for sleep staging, one of the three analyzed architectures.