concept
active
concept:monosemantic-functional-featuresMonosemantic Functional Features
Features that correspond to a single semantic concept and are effective for steering behavior.
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Internal Featuresassociated_withRepresentations inside LLMs that can be intervened upon.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Interpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
- Quantitative assessment of feature quality using clinical concepts across models.
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)concept0.752Key paper on scaling SAE-based interpretability to frontier models, cited as precedent
- Neurons that respond to multiple unrelated concepts, limiting interpretability.
- Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.743Foundational SAE mechanistic interpretability paper
- Mechanistic finding by Bricken et al. 2023 about how LLMs store features; cited as operational justification for pattern-repository assumption
- Claim that feature grounding enables interpretability metrics.
- Property that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work