claim
active
claim:training-models-with-sparse-activations-cannot-fully-prevent-polysemanticity-because-cross-entropy-loss-creates-incentives-for-polysemantic-neurons-even-without-superpositionTraining models with sparse activations cannot fully prevent polysemanticity because cross-entropy loss creates incentives for polysemantic neurons even without superposition
Author's conclusion after extensive investigation of architectural approaches to monosemanticity
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Findings (1)
finding
- Counter-example disproving that architectural sparsity alone can prevent polysemanticity
Frameworks (1)
framework
- SoLU Activation FunctionextendsPrior Anthropic approach to increasing neuron monosemanticity via activation function design; found to make some neurons more interpretable at cost of others
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- We hypothesize that polysemantic neurons may be resolvable by unfolding networks or training to avoid polysemanticity.hypothesis0.807Forward-looking proposal for how the polysemanticity challenge to circuits research might be overcome
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
- Critique of activation-based interpretability methods.
- Fundamental assertion: single imperative (free energy minimization) explains diverse cognitive and neural phenomena.
- Identified gap linking polysemanticity challenge to disentangled representations literature
- Precise characterization of why polysemanticity poses a combinatorial obstacle to circuit analysis
- Key limitation acknowledged by authors.