quote
active
quote:the-arabic-script-feature-would-be-effectively-invisible-if-we-only-analyzed-the-model-in-terms-of-neurons

The Arabic script feature would be effectively invisible if we only analyzed the model in terms of neurons.

Summarizes key finding that monosemantic features cannot be discovered by neuron-level analysis

Source paper

extracted_from
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.