finding

active

finding:512-neuron-mlp-continues-to-yield-new-features-as-autoencoder-scales-to-131-072-features-256-expansion

512-neuron MLP continues to yield new features as autoencoder scales to 131,072 features (256× expansion)

Shows superposition enables many more features than neurons

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Superposition Hypothesis
supports
Core theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

A/5 autoencoder (131,072 features) recovers 94.5% of MLP log-likelihood loss reductionfinding0.787
Shows that loss recovery increases with autoencoder size
Approximately 0.2% of MLP neurons at layer 18 (~28 neurons) are sufficient to account for the generic addition computation across all cyclic tasksclaim0.785
Claim about the sparsity and sufficiency of the identified neuron set
A/1 autoencoder recovers 79% of MLP log-likelihood loss reduction with 4,096 featuresfinding0.784
Measures how much of the MLP layer's function is explained by the learned features
A sparse set of 28 MLP neurons at layer 18 (~0.2% of MLP) are reused across all cyclic tasksfinding0.769
Quantitative finding identifying the specific neurons responsible for generic addition
The 28 MLP neurons at layer 18 can be partitioned into disjoint clusters each computing the sum for a Fourier feature with a different periodfinding0.764
Structural finding showing modular organization within the sparse neuron set
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.748
SAE features are not simply mirroring individual neurons.
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.740
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Arabic feature A/1/3450 has 27 neurons with coefficient magnitude ≥0.1 and three largest coefficients are negative; most correlated neuron responds to mixture of non-English languagesfinding0.739
Demonstrates that the Arabic feature is not aligned to any single neuron