finding
active
finding:feature-pair-a-1-3949-and-b-1-3321-have-activation-correlation-0-98-but-negative-logit-weight-correlation-firing-on-plosone-journal-abbreviations

Feature pair A/1/3949 and B/1/3321 have activation correlation 0.98 but negative logit weight correlation, firing on PLOSOne journal abbreviations

Demonstrates that activation similarity can diverge from logit weight similarity due to interference

Source paper

extracted_from
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.