finding
active
finding:models-with-1-hot-activation-sparsity-still-have-polysemantic-neurons-single-neuron-trained-on-4-mutually-exclusive-features-prefers-polysemantic-representation-with-loss-0-7-vs-0-8Models with 1-hot activation sparsity still have polysemantic neurons; single neuron trained on 4 mutually exclusive features prefers polysemantic representation with loss ~0.7 vs 0.8
Counter-example disproving that architectural sparsity alone can prevent polysemanticity
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Author's conclusion after extensive investigation of architectural approaches to monosemanticity
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Precise characterization of why polysemanticity poses a combinatorial obstacle to circuit analysis
- We hypothesize that polysemantic neurons may be resolvable by unfolding networks or training to avoid polysemanticity.hypothesis0.786Forward-looking proposal for how the polysemanticity challenge to circuits research might be overcome
- Fundamental assertion: single imperative (free energy minimization) explains diverse cognitive and neural phenomena.
- Identified gap linking polysemanticity challenge to disentangled representations literature
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
- Systematic comparison showing features are substantially more universal than neurons across models
- Automated interpretability analysis of activations confirms features are more interpretable than neurons
- SAE features are not simply mirroring individual neurons.