quote

active

quote:the-arabic-script-feature-would-be-effectively-invisible-if-we-only-analyzed-the-model-in-terms-of-neurons

The Arabic script feature would be effectively invisible if we only analyzed the model in terms of neurons.

Summarizes key finding that monosemantic features cannot be discovered by neuron-level analysis

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The familiarity of the text block renders its conventions almost invisible.claim0.793
Points out how deeply internalized layout norms become naturalized and unnoticed.
Arabic feature A/1/3450 has 27 neurons with coefficient magnitude ≥0.1 and three largest coefficients are negative; most correlated neuron responds to mixture of non-English languagesfinding0.764
Demonstrates that the Arabic feature is not aligned to any single neuron
Arabic script feature A/1/3450 fires on 81% Arabic-script tokens when active, with 98% specificity at high activation levelsfinding0.752
Demonstrates activation specificity of the Arabic script sparse autoencoder feature
In some sense, this is the simplest language model we profoundly don't understand. And so it makes a natural target for our paper.quote0.744
Articulates why a one-layer transformer with MLP is the appropriate starting target for mechanistic interpretability
An interplay between causal abstraction and feature geometry deepens mechanistic understanding of language modelsclaim0.733
Methodological claim about the scientific value of combining causal abstraction with representational geometry analysis
Our SAEs' features are more interpretable than neurons.claim0.731
Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
If a text attempts to stand alone, it will almost certainly attract commentary or interference.hypothesis0.727
Predicts the inevitability of dialogic intrusion upon any statement.
"We should avoid quotes around mental terms because there is no absolute, binary distinction between it knows and it knows—only a difference in the degree to which a model will be useful."concept0.727
Core assertion against categorical distinction between genuine and metaphorical cognition; justifies continuous gradualist approach.