claim

active

claim:training-models-with-sparse-activations-cannot-fully-prevent-polysemanticity-because-cross-entropy-loss-creates-incentives-for-polysemantic-neurons-even-without-superposition

Training models with sparse activations cannot fully prevent polysemanticity because cross-entropy loss creates incentives for polysemantic neurons even without superposition

Author's conclusion after extensive investigation of architectural approaches to monosemanticity

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Findings (1)

finding

Models with 1-hot activation sparsity still have polysemantic neurons; single neuron trained on 4 mutually exclusive features prefers polysemantic representation with loss ~0.7 vs 0.8
supports
Counter-example disproving that architectural sparsity alone can prevent polysemanticity

Frameworks (1)

framework

SoLU Activation Function
extends
Prior Anthropic approach to increasing neuron monosemanticity via activation function design; found to make some neurons more interpretable at cost of others

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We hypothesize that polysemantic neurons may be resolvable by unfolding networks or training to avoid polysemanticity.hypothesis0.807
Forward-looking proposal for how the polysemanticity challenge to circuits research might be overcome
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.791
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parametersclaim0.770
Critique of activation-based interpretability methods.
All neuronal processing and action selection minimize variational free energy, unifying perception, action, and learning.claim0.770
Fundamental assertion: single imperative (free energy minimization) explains diverse cognitive and neural phenomena.
No established method for resolving polysemantic neurons into pure features at scalequestion0.769
Identified gap linking polysemanticity challenge to disentangled representations literature
Polysemantic neurons are a major challenge for the circuits agenda, because N meanings in one neuron times M in another creates NxM effective connections that cannot be considered individually.claim0.768
Precise characterization of why polysemanticity poses a combinatorial obstacle to circuit analysis
Results may not fully generalize to all models and scenarios because the model organism relies on hints and nudges and Llama Nemotron cannot consistently distinguish evaluation/deployment based on subtle cuesclaim0.761
Key limitation acknowledged by authors.
Associative learning criterion can occur in gene regulatory networks and non-neural morphogenetic agentshypothesis0.755