finding
active
finding:most-correlated-neuron-a-neurons-470-has-correlation-of-only-0-18-with-base64-feature-a-1-2357-and-responds-to-code-html-labels-urlsMost correlated neuron A/neurons/470 has correlation of only 0.18 with base64 feature A/1/2357 and responds to code, HTML labels, URLs
Shows base64 feature is polysemantic at neuron level but monosemantic as learned feature
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Hebrew feature is effectively invisible in the neuron basis
- Systematic comparison showing features are substantially more universal than neurons across models
- Demonstrates that the Arabic feature is not aligned to any single neuron
- Universality of base64 feature across two transformers
- Automated interpretability analysis of activations confirms features are more interpretable than neurons
- SAE features are not simply mirroring individual neurons.
- Universality of DNA feature across two transformer models with different random seeds
- Demonstrates that activation similarity can diverge from logit weight similarity due to interference