finding

active

finding:features-in-a-1-have-median-activation-correlation-of-0-72-with-most-similar-feature-in-b-1-neurons-have-median-0-46

Features in A/1 have median activation correlation of 0.72 with most similar feature in B/1; neurons have median 0.46

Systematic comparison showing features are substantially more universal than neurons across models

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Feature universality across independently trained models suggests features have some existence beyond individual models
associated_withsupports
Authors take agnostic position on ontological status but universality evidence pushes toward features being real

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Most correlated neuron A/neurons/470 has correlation of only 0.18 with base64 feature A/1/2357 and responds to code, HTML labels, URLsfinding0.848
Shows base64 feature is polysemantic at neuron level but monosemantic as learned feature
DNA feature A/1/2937 and B/1/3680 have activation correlation of 0.92finding0.839
Universality of DNA feature across two transformer models with different random seeds
Arabic feature A/1/3450 has 27 neurons with coefficient magnitude ≥0.1 and three largest coefficients are negative; most correlated neuron responds to mixture of non-English languagesfinding0.818
Demonstrates that the Arabic feature is not aligned to any single neuron
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.807
SAE features are not simply mirroring individual neurons.
Base64 feature A/1/2357 and B/1/2165 have activation correlation of 0.85finding0.805
Universality of base64 feature across two transformers
Hebrew feature A/1/416 and B/1/1901 have activation correlation of 0.92finding0.803
Universality of Hebrew script feature across two transformers
Arabic feature A/1/3450 and B/1/1334 have activation correlation of 0.91 across 40M tokensfinding0.803
Demonstrates universality of the Arabic script feature across two independently trained transformers
Claude achieves significantly higher Spearman correlation predicting feature activations vs neuron activationsfinding0.803
Automated interpretability analysis of activations confirms features are more interpretable than neurons