Monosemantic Functional Features

Features that correspond to a single semantic concept and are effective for steering behavior.

Neighborhood — ranked by edge-count

paper

concept

Internal Features
associated_with
Representations inside LLMs that can be intervened upon.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

monosemanticityconcept0.844
Interpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
Monosemanticity and entanglement of SAE features were benchmarked for clinical taxonomy grounding across SleepFM, REVE, LaBraM.finding0.792
Quantitative assessment of feature quality using clinical concepts across models.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)concept0.752
Key paper on scaling SAE-based interpretability to frontier models, cited as precedent
Polysemanticityconcept0.746
Neurons that respond to multiple unrelated concepts, limiting interpretability.
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.743
Foundational SAE mechanistic interpretability paper
Superposition of Sparse Featuresconcept0.742
Mechanistic finding by Bricken et al. 2023 about how LLMs store features; cited as operational justification for pattern-repository assumption
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.741
Claim that feature grounding enables interpretability metrics.
Feature Sparsityconcept0.730
Property that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work