finding
active
finding:feature-pair-a-1-3949-and-b-1-3321-have-activation-correlation-0-98-but-negative-logit-weight-correlation-firing-on-plosone-journal-abbreviationsFeature pair A/1/3949 and B/1/3321 have activation correlation 0.98 but negative logit weight correlation, firing on PLOSOne journal abbreviations
Demonstrates that activation similarity can diverge from logit weight similarity due to interference
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Universality of DNA feature across two transformer models with different random seeds
- Arabic feature A/1/3450 and B/1/1334 have activation correlation of 0.91 across 40M tokensfinding0.811Demonstrates universality of the Arabic script feature across two independently trained transformers
- Universality of base64 feature across two transformers
- Universality of Hebrew script feature across two transformers
- Systematic comparison showing features are substantially more universal than neurons across models
- Demonstrates specificity and sensitivity of DNA feature
- Shows base64 feature is polysemantic at neuron level but monosemantic as learned feature
- Key quantitative evidence that detection signal is identical to global logit shift confound