finding

active

finding:feature-pair-a-1-3949-and-b-1-3321-have-activation-correlation-0-98-but-negative-logit-weight-correlation-firing-on-plosone-journal-abbreviations

Feature pair A/1/3949 and B/1/3321 have activation correlation 0.98 but negative logit weight correlation, firing on PLOSOne journal abbreviations

Demonstrates that activation similarity can diverge from logit weight similarity due to interference

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DNA feature A/1/2937 and B/1/3680 have activation correlation of 0.92finding0.836
Universality of DNA feature across two transformer models with different random seeds
Arabic feature A/1/3450 and B/1/1334 have activation correlation of 0.91 across 40M tokensfinding0.811
Demonstrates universality of the Arabic script feature across two independently trained transformers
Base64 feature A/1/2357 and B/1/2165 have activation correlation of 0.85finding0.806
Universality of base64 feature across two transformers
Hebrew feature A/1/416 and B/1/1901 have activation correlation of 0.92finding0.800
Universality of Hebrew script feature across two transformers
Features in A/1 have median activation correlation of 0.72 with most similar feature in B/1; neurons have median 0.46finding0.796
Systematic comparison showing features are substantially more universal than neurons across models
Binarized DNA proxy has Pearson correlation of 0.80 with A/1/2937 feature activationsfinding0.778
Demonstrates specificity and sensitivity of DNA feature
Most correlated neuron A/neurons/470 has correlation of only 0.18 with base64 feature A/1/2357 and responds to code, HTML labels, URLsfinding0.768
Shows base64 feature is polysemantic at neuron level but monosemantic as learned feature
Correlation r=0.999 between detection-adjusted logit difference and control logit increase across all 40 layer-strength configurationsfinding0.762
Key quantitative evidence that detection signal is identical to global logit shift confound