finding
active
finding:saes-successfully-extract-sparse-feature-dictionaries-from-embeddings-of-sleepfm-reve-and-labram-eeg-transformersSAEs successfully extract sparse feature dictionaries from embeddings of SleepFM, REVE, and LaBraM EEG transformers.
Foundational empirical result enabling all downstream analysis
Source paper
extracted_from(2026) · William Lehn-Schiøler · Magnus Ruud Kjær · Rahul Thapa · M. Pedersen +9
Neighborhood — ranked by edge-count
Claims (1)
claim
- Key methodological contribution claim about architecture-agnostic SAE tuning
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Research question motivating the monosemanticity and entanglement benchmarking
- Quantitative assessment of feature quality using clinical concepts across models.
- Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
- Sparse dictionary learning method used to extract interpretable features from EEG transformer embeddings.
- Standard interpretability approach that VPD critiques and proposes an alternative to.
- Central claim of the paper: the method scales to state-of-the-art transformers.
- Claim that feature grounding enables interpretability metrics.