quote
active
quote:the-arabic-script-feature-would-be-effectively-invisible-if-we-only-analyzed-the-model-in-terms-of-neuronsThe Arabic script feature would be effectively invisible if we only analyzed the model in terms of neurons.
Summarizes key finding that monosemantic features cannot be discovered by neuron-level analysis
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Points out how deeply internalized layout norms become naturalized and unnoticed.
- Demonstrates that the Arabic feature is not aligned to any single neuron
- Demonstrates activation specificity of the Arabic script sparse autoencoder feature
- Articulates why a one-layer transformer with MLP is the appropriate starting target for mechanistic interpretability
- Methodological claim about the scientific value of combining causal abstraction with representational geometry analysis
- Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
- If a text attempts to stand alone, it will almost certainly attract commentary or interference.hypothesis0.727Predicts the inevitability of dialogic intrusion upon any statement.
- Core assertion against categorical distinction between genuine and metaphorical cognition; justifies continuous gradualist approach.