DNA feature A/1/2937 and B/1/3680 have activation correlation of 0.92

Universality of DNA feature across two transformer models with different random seeds

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Feature universality across independently trained models suggests features have some existence beyond individual models
supports
Authors take agnostic position on ontological status but universality evidence pushes toward features being real

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Base64 feature A/1/2357 and B/1/2165 have activation correlation of 0.85finding0.865
Universality of base64 feature across two transformers
Hebrew feature A/1/416 and B/1/1901 have activation correlation of 0.92finding0.865
Universality of Hebrew script feature across two transformers
Binarized DNA proxy has Pearson correlation of 0.80 with A/1/2937 feature activationsfinding0.857
Demonstrates specificity and sensitivity of DNA feature
Arabic feature A/1/3450 and B/1/1334 have activation correlation of 0.91 across 40M tokensfinding0.847
Demonstrates universality of the Arabic script feature across two independently trained transformers
Features in A/1 have median activation correlation of 0.72 with most similar feature in B/1; neurons have median 0.46finding0.839
Systematic comparison showing features are substantially more universal than neurons across models
Feature pair A/1/3949 and B/1/3321 have activation correlation 0.98 but negative logit weight correlation, firing on PLOSOne journal abbreviationsfinding0.836
Demonstrates that activation similarity can diverge from logit weight similarity due to interference
Most correlated neuron A/neurons/470 has correlation of only 0.18 with base64 feature A/1/2357 and responds to code, HTML labels, URLsfinding0.782
Shows base64 feature is polysemantic at neuron level but monosemantic as learned feature
Single base64 feature A/0/45 splits into three distinct features in A/1: letter-specific, digit-specific, and ASCII-encoding-specificfinding0.751
Concrete example of feature splitting revealing unexpected model structure