finding

active

finding:arabic-feature-a-1-3450-has-27-neurons-with-coefficient-magnitude-0-1-and-three-largest-coefficients-are-negative-most-correlated-neuron-responds-to-mixture-of-non-english-languages

Arabic feature A/1/3450 has 27 neurons with coefficient magnitude ≥0.1 and three largest coefficients are negative; most correlated neuron responds to mixture of non-English languages

Demonstrates that the Arabic feature is not aligned to any single neuron

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidence
supports
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

No neuron found with Hebrew Unicode block in top dataset examples; most correlated neuron A/neurons/489 has correlation of only 0.1finding0.820
Hebrew feature is effectively invisible in the neuron basis
Arabic feature A/1/3450 and B/1/1334 have activation correlation of 0.91 across 40M tokensfinding0.819
Demonstrates universality of the Arabic script feature across two independently trained transformers
Features in A/1 have median activation correlation of 0.72 with most similar feature in B/1; neurons have median 0.46finding0.818
Systematic comparison showing features are substantially more universal than neurons across models
Most correlated neuron A/neurons/470 has correlation of only 0.18 with base64 feature A/1/2357 and responds to code, HTML labels, URLsfinding0.813
Shows base64 feature is polysemantic at neuron level but monosemantic as learned feature
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.776
SAE features are not simply mirroring individual neurons.
Fourier features with period 10 contribute to base-10 sum computation in the 28-neuron clusterfinding0.766
One of the three base-10 Fourier periods identified in the sparse neuron set
The Arabic script feature would be effectively invisible if we only analyzed the model in terms of neurons.quote0.764
Summarizes key finding that monosemantic features cannot be discovered by neuron-level analysis
Superposition hypothesis: neural networks represent more features than dimensions using almost-orthogonal directions.hypothesis0.755
Explanation for why dictionary learning can recover many more features than dimensions.