question

active

question:no-established-method-for-resolving-polysemantic-neurons-into-pure-features-at-scale

No established method for resolving polysemantic neurons into pure features at scale

Identified gap linking polysemanticity challenge to disentangled representations literature

Source paper

extracted_from

Zoom In: An Introduction to Circuits

(2020) · Chris Olah · Nick Cammarata · Ludwig Schubert · Gabriel Goh +2

Neighborhood — ranked by edge-count

Papers (1)

paper

Zoom In: An Introduction to Circuits
associated_with

Claims (1)

claim

Features are connected by weights forming circuits, and these circuits can be rigorously studied and understood as meaningful algorithms.
gates
Second of three speculative claims asserting that subgraphs of neural networks are tractable and meaningful objects of study

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We hypothesize that polysemantic neurons may be resolvable by unfolding networks or training to avoid polysemanticity.hypothesis0.827
Forward-looking proposal for how the polysemanticity challenge to circuits research might be overcome
Polysemantic neurons are a major challenge for the circuits agenda, because N meanings in one neuron times M in another creates NxM effective connections that cannot be considered individually.claim0.800
Precise characterization of why polysemanticity poses a combinatorial obstacle to circuit analysis
Models with 1-hot activation sparsity still have polysemantic neurons; single neuron trained on 4 mutually exclusive features prefers polysemantic representation with loss ~0.7 vs 0.8finding0.783
Counter-example disproving that architectural sparsity alone can prevent polysemanticity
Polysemantic Neuronconcept0.780
A neuron that responds to multiple unrelated inputs, posing a major challenge for circuit-level interpretation
Superposition is in some sense deliberate: the model converts pure neurons into polysemantic neurons to store more features in fewer neurons.claim0.769
Interpretation of the cars-in-superposition circuit finding as an intentional representational strategy
Training models with sparse activations cannot fully prevent polysemanticity because cross-entropy loss creates incentives for polysemantic neurons even without superpositionclaim0.769
Author's conclusion after extensive investigation of architectural approaches to monosemanticity
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.768
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Smolensky (1986) proposes that viewing a neural representation under a basis that is not aligned with individual neurons can reveal the interpretable distributed structure of the neural representations.quote0.742
Load-bearing theoretical claim providing the conceptual foundation for DAS.