finding
active
finding:arabic-script-feature-a-1-3450-fires-on-81-arabic-script-tokens-when-active-with-98-specificity-at-high-activation-levelsArabic script feature A/1/3450 fires on 81% Arabic-script tokens when active, with 98% specificity at high activation levels
Demonstrates activation specificity of the Arabic script sparse autoencoder feature
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Arabic feature A/1/3450 and B/1/1334 have activation correlation of 0.91 across 40M tokensfinding0.834Demonstrates universality of the Arabic script feature across two independently trained transformers
- Pearson correlation of 0.74 between A/1/3450 activation and Arabic script proxy over 40M tokensfinding0.800Joint measure of sensitivity and specificity for the Arabic script feature
- Summarizes key finding that monosemantic features cannot be discovered by neuron-level analysis
- Universality of Hebrew script feature across two transformers
- Demonstrates prevalence of token-in-context features and feature splitting of common tokens
- Demonstrates that the Arabic feature is not aligned to any single neuron
- Pinning A/1/3450 to maximum observed value causes model to generate Arabic text from numeric prefix contextfinding0.728Causal validation that the Arabic feature has the predicted downstream effect on generation
- Shows a general code error detector beyond simple typo detection.