concept
active
concept:monosemanticitymonosemanticity
Interpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- TopK Sparse AutoencodersimplementsThe central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries
Concepts (3)
concept
- Polysemanticityassociated_withrelated_toNeurons that respond to multiple unrelated concepts, limiting interpretability.
- Clinical Taxonomyassociated_withThe grounding schema comprising abnormality, age, sex, and medication used to interpret SAE features
- Feature splittingassociated_withPhenomenon where a feature in a small SAE splits into multiple finer features in a larger SAE.
Events (1)
event
- Preprint applying TopK SAEs to three EEG transformers to reveal sparse feature dictionaries, steering regimes, and spectral interpretation.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Features that correspond to a single semantic concept and are effective for steering behavior.
- Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.803Foundational SAE mechanistic interpretability paper
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)concept0.799Key paper on scaling SAE-based interpretability to frontier models, cited as precedent
- Quantitative assessment of feature quality using clinical concepts across models.
- Standard algebraic abstraction with identity (ε) and associative binary operation ('); used to specify Image overlay operations.
- Inherent in Linda because an in statement chooses one matching tuple arbitrarily; essential for many parallel patterns.
- Property of developmental systems where functions are encapsulated in modules with simple triggers, enhancing evolvability.
- Kuhn's concept: the inability of ideas from one paradigm to be translated into the terms of another, causing communication breakdowns.