finding

active

finding:base64-feature-a-1-2357-and-b-1-2165-have-activation-correlation-of-0-85

Base64 feature A/1/2357 and B/1/2165 have activation correlation of 0.85

Universality of base64 feature across two transformers

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Feature universality across independently trained models suggests features have some existence beyond individual models
supports
Authors take agnostic position on ontological status but universality evidence pushes toward features being real

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DNA feature A/1/2937 and B/1/3680 have activation correlation of 0.92finding0.865
Universality of DNA feature across two transformer models with different random seeds
Hebrew feature A/1/416 and B/1/1901 have activation correlation of 0.92finding0.859
Universality of Hebrew script feature across two transformers
Arabic feature A/1/3450 and B/1/1334 have activation correlation of 0.91 across 40M tokensfinding0.854
Demonstrates universality of the Arabic script feature across two independently trained transformers
Activating the base64 feature A/1/2357 causes the model to generate base64 textfinding0.821
Causal validation of base64 feature function via pinned feature sampling
Most correlated neuron A/neurons/470 has correlation of only 0.18 with base64 feature A/1/2357 and responds to code, HTML labels, URLsfinding0.812
Shows base64 feature is polysemantic at neuron level but monosemantic as learned feature
Feature pair A/1/3949 and B/1/3321 have activation correlation 0.98 but negative logit weight correlation, firing on PLOSOne journal abbreviationsfinding0.806
Demonstrates that activation similarity can diverge from logit weight similarity due to interference
Features in A/1 have median activation correlation of 0.72 with most similar feature in B/1; neurons have median 0.46finding0.805
Systematic comparison showing features are substantially more universal than neurons across models
Binarized DNA proxy has Pearson correlation of 0.80 with A/1/2937 feature activationsfinding0.787
Demonstrates specificity and sensitivity of DNA feature