finding

active

finding:most-correlated-neuron-a-neurons-470-has-correlation-of-only-0-18-with-base64-feature-a-1-2357-and-responds-to-code-html-labels-urls

Most correlated neuron A/neurons/470 has correlation of only 0.18 with base64 feature A/1/2357 and responds to code, HTML labels, URLs

Shows base64 feature is polysemantic at neuron level but monosemantic as learned feature

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

No neuron found with Hebrew Unicode block in top dataset examples; most correlated neuron A/neurons/489 has correlation of only 0.1finding0.870
Hebrew feature is effectively invisible in the neuron basis
Features in A/1 have median activation correlation of 0.72 with most similar feature in B/1; neurons have median 0.46finding0.848
Systematic comparison showing features are substantially more universal than neurons across models
Arabic feature A/1/3450 has 27 neurons with coefficient magnitude ≥0.1 and three largest coefficients are negative; most correlated neuron responds to mixture of non-English languagesfinding0.813
Demonstrates that the Arabic feature is not aligned to any single neuron
Base64 feature A/1/2357 and B/1/2165 have activation correlation of 0.85finding0.812
Universality of base64 feature across two transformers
Claude achieves significantly higher Spearman correlation predicting feature activations vs neuron activationsfinding0.802
Automated interpretability analysis of activations confirms features are more interpretable than neurons
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.785
SAE features are not simply mirroring individual neurons.
DNA feature A/1/2937 and B/1/3680 have activation correlation of 0.92finding0.782
Universality of DNA feature across two transformer models with different random seeds
Feature pair A/1/3949 and B/1/3321 have activation correlation 0.98 but negative logit weight correlation, firing on PLOSOne journal abbreviationsfinding0.768
Demonstrates that activation similarity can diverge from logit weight similarity due to interference