finding

active

finding:models-with-1-hot-activation-sparsity-still-have-polysemantic-neurons-single-neuron-trained-on-4-mutually-exclusive-features-prefers-polysemantic-representation-with-loss-0-7-vs-0-8

Models with 1-hot activation sparsity still have polysemantic neurons; single neuron trained on 4 mutually exclusive features prefers polysemantic representation with loss ~0.7 vs 0.8

Counter-example disproving that architectural sparsity alone can prevent polysemanticity

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Training models with sparse activations cannot fully prevent polysemanticity because cross-entropy loss creates incentives for polysemantic neurons even without superposition
supports
Author's conclusion after extensive investigation of architectural approaches to monosemanticity

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Polysemantic neurons are a major challenge for the circuits agenda, because N meanings in one neuron times M in another creates NxM effective connections that cannot be considered individually.claim0.792
Precise characterization of why polysemanticity poses a combinatorial obstacle to circuit analysis
We hypothesize that polysemantic neurons may be resolvable by unfolding networks or training to avoid polysemanticity.hypothesis0.786
Forward-looking proposal for how the polysemanticity challenge to circuits research might be overcome
All neuronal processing and action selection minimize variational free energy, unifying perception, action, and learning.claim0.784
Fundamental assertion: single imperative (free energy minimization) explains diverse cognitive and neural phenomena.
No established method for resolving polysemantic neurons into pure features at scalequestion0.783
Identified gap linking polysemanticity challenge to disentangled representations literature
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.782
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Features in A/1 have median activation correlation of 0.72 with most similar feature in B/1; neurons have median 0.46finding0.774
Systematic comparison showing features are substantially more universal than neurons across models
Claude achieves significantly higher Spearman correlation predicting feature activations vs neuron activationsfinding0.766
Automated interpretability analysis of activations confirms features are more interpretable than neurons
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.757
SAE features are not simply mirroring individual neurons.