finding
active
finding:monosemanticity-and-entanglement-of-sae-features-were-benchmarked-for-clinical-taxonomy-grounding-across-sleepfm-reve-labramMonosemanticity and entanglement of SAE features were benchmarked for clinical taxonomy grounding across SleepFM, REVE, LaBraM.
Quantitative assessment of feature quality using clinical concepts across models.
Source paper
extracted_from(2026) · William Lehn-Schiøler · Magnus Ruud Kjær · Rahul Thapa · M. Pedersen +9
Neighborhood — ranked by edge-count
Claims (1)
claim
- Claim that feature grounding enables interpretability metrics.
Communities (3)
community
- Explores geometry of activation/behavior manifolds to enable selective, non-destructive concept interventions.
- Investigates inseparability of clinical concepts (age, pathology) in EEG transformers using SAE feature analysis and steering metrics across SleepFM, REVE, LaBraM architectures.
- Evaluating sparse autoencoder monosemanticity and entanglement using clinical taxonomy grounding across EEG/sleep foundation models.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Foundational empirical result enabling all downstream analysis
- Research question motivating the monosemanticity and entanglement benchmarking
- Features that correspond to a single semantic concept and are effective for steering behavior.
- Interpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
- SAE features are not simply mirroring individual neurons.
- Extension of mechanistic interpretability findings to the metacognitive domain
- Result categorizing concept steerability into three distinct regimes.
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights