claim
active
claim:sae-features-can-be-grounded-in-clinical-taxonomy-abnormality-age-sex-medication-to-benchmark-monosemanticity-and-entanglementSAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.
Claim that feature grounding enables interpretability metrics.
Source paper
extracted_from(2026) · William Lehn-Schiøler · Magnus Ruud Kjær · Rahul Thapa · M. Pedersen +9
Neighborhood — ranked by edge-count
Findings (1)
finding
- Quantitative assessment of feature quality using clinical concepts across models.
Communities (3)
community
- Explores geometry of activation/behavior manifolds to enable selective, non-destructive concept interventions.
- Investigates inseparability of clinical concepts (age, pathology) in EEG transformers using SAE feature analysis and steering metrics across SleepFM, REVE, LaBraM architectures.
- Evaluating sparse autoencoder monosemanticity and entanglement using clinical taxonomy grounding across EEG/sleep foundation models.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Extension of mechanistic interpretability findings to the metacognitive domain
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
- Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
- Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
- Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
- Highest-rated emotional SAE feature; self-report describes overwhelming positive emotional valence
- Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
- Overarching motivating hypothesis of the paper