framework
active
framework:superposition-hypothesisSuperposition Hypothesis
Core theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (6)
concept
- Linear representationsupportsThe idea that features are encoded as directions in activation space.
- Polysemanticityassociated_withNeurons that respond to multiple unrelated concepts, limiting interpretability.
- Feature SparsitysupportsProperty that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work
- Memorization in Superpositionassociated_withSpecific phrases or sequences memorized via binary features in superposition, enabling narrow pattern matching despite few neurons
- Noisy Simulation of Sparse Networksassociated_withMechanism by which superposition works: small neural networks exploit sparsity to approximately simulate much larger sparse networks
- Overcomplete Basisassociated_withA set of feature directions that is larger than the dimensionality of the space, enabling superposition
Claims (2)
claim
- Motivates the introduction of mass-mean probing as an alternative to LR
- Authors' overall conclusion from number of interpretable features, activation-level correspondence to intensity, sensible logit weights, and interference weights
Frameworks (2)
framework
- Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
- DisentanglementcontradictsRelated research agenda seeking representations that separate conceptually distinct factors; contrasted with superposition approach
Findings (1)
finding
- Shows superposition enables many more features than neurons
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Phenomenon where models represent more features than dimensions via almost-orthogonal directions.
- Representation of features spread across multiple layers, complicating dictionary learning.
- Theoretical model of how neural networks encode more features than dimensions, informing linear representation work.
- The state in which a dialogue agent maintains multiple possible characters simultaneously, refined as the conversation proceeds
- The conjecture that consciousness does not result from the organized mind but creates and maintains complex models of reality; forms at the beginning of mental development
- The more nuanced second metaphor: LLM as simulator maintaining a superposition of possible simulacra across a multiverse of characters
- The hypothesis that analogous features and circuits reliably form across different neural network models and tasks
- Prior model of superposition where features are discrete 1D objects repelling each other roughly evenly; paper argues this is incomplete