concept
active
concept:polysemanticityPolysemanticity
Neurons that respond to multiple unrelated concepts, limiting interpretability.
Neighborhood — ranked by edge-count
Papers (1)
paper
Frameworks (1)
framework
- Superposition Hypothesisassociated_withCore theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition
Claims (1)
claim
- Key limitation of the paper's approach; MLP layers make up 2/3 of standard transformer parameters
Concepts (2)
concept
- monosemanticityassociated_withrelated_toInterpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
- Privileged Basisassociated_withA property of activations where neural network features align with basis dimensions due to sparse activation functions; absent in the residual stream but present in tokens, attention patterns, and MLP activations
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- A neuron that responds to multiple unrelated inputs, posing a major challenge for circuit-level interpretation
- Features that correspond to a single semantic concept and are effective for steering behavior.
- Quantitative assessment of feature quality using clinical concepts across models.
- Property of developmental systems where functions are encapsulated in modules with simple triggers, enhancing evolvability.
- Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.723Foundational SAE mechanistic interpretability paper
- Identified gap linking polysemanticity challenge to disentangled representations literature
- Physical plasticity of individual cells or particles enabling adaptation to novel environments.
- Biological architecture where multiple competing/cooperating multi-scale agents develop interpretations of molecular and biophysical states rather than committing to single interpretation.