finding

active

finding:no-neuron-found-with-hebrew-unicode-block-in-top-dataset-examples-most-correlated-neuron-a-neurons-489-has-correlation-of-only-0-1

No neuron found with Hebrew Unicode block in top dataset examples; most correlated neuron A/neurons/489 has correlation of only 0.1

Hebrew feature is effectively invisible in the neuron basis

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidence
supports
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Most correlated neuron A/neurons/470 has correlation of only 0.18 with base64 feature A/1/2357 and responds to code, HTML labels, URLsfinding0.870
Shows base64 feature is polysemantic at neuron level but monosemantic as learned feature
Arabic feature A/1/3450 has 27 neurons with coefficient magnitude ≥0.1 and three largest coefficients are negative; most correlated neuron responds to mixture of non-English languagesfinding0.820
Demonstrates that the Arabic feature is not aligned to any single neuron
Features in A/1 have median activation correlation of 0.72 with most similar feature in B/1; neurons have median 0.46finding0.773
Systematic comparison showing features are substantially more universal than neurons across models
Hebrew feature A/1/416 and B/1/1901 have activation correlation of 0.92finding0.770
Universality of Hebrew script feature across two transformers
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.764
SAE features are not simply mirroring individual neurons.
There is a many-to-many mapping between neurons and concepts, meaning multiple high-level causal variables might be encoded in overlapping groups of neuronsclaim0.762
Fundamental theoretical claim motivating DAS, attributed to Smolensky/Rumelhart/McClelland.
Fourier features with period 10 contribute to base-10 sum computation in the 28-neuron clusterfinding0.757
One of the three base-10 Fourier periods identified in the sparse neuron set
Neurons can correspond to interpretable functional roles but interpretations in terms of individual neurons are unlikely to be the most parsimoniousclaim0.749
Claim from footnote 3, acknowledging neuron-level interpretability while arguing subcomponents are better.