finding

active

finding:arabic-script-feature-a-1-3450-fires-on-81-arabic-script-tokens-when-active-with-98-specificity-at-high-activation-levels

Arabic script feature A/1/3450 fires on 81% Arabic-script tokens when active, with 98% specificity at high activation levels

Demonstrates activation specificity of the Arabic script sparse autoencoder feature

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidence
supports
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Arabic feature A/1/3450 and B/1/1334 have activation correlation of 0.91 across 40M tokensfinding0.834
Demonstrates universality of the Arabic script feature across two independently trained transformers
Pearson correlation of 0.74 between A/1/3450 activation and Arabic script proxy over 40M tokensfinding0.800
Joint measure of sensitivity and specificity for the Arabic script feature
The Arabic script feature would be effectively invisible if we only analyzed the model in terms of neurons.quote0.752
Summarizes key finding that monosemantic features cannot be discovered by neuron-level analysis
Hebrew feature A/1/416 and B/1/1901 have activation correlation of 0.92finding0.744
Universality of Hebrew script feature across two transformers
In A/4, over 100 features primarily respond to the token 'the' in different contextsfinding0.740
Demonstrates prevalence of token-in-context features and feature splitting of common tokens
Arabic feature A/1/3450 has 27 neurons with coefficient magnitude ≥0.1 and three largest coefficients are negative; most correlated neuron responds to mixture of non-English languagesfinding0.731
Demonstrates that the Arabic feature is not aligned to any single neuron
Pinning A/1/3450 to maximum observed value causes model to generate Arabic text from numeric prefix contextfinding0.728
Causal validation that the Arabic feature has the predicted downstream effect on generation
Feature 1M/1013764 activates on diverse code errors (typos in code, array overflow, divide by zero, type mismatch) across Python, C, Scheme, but not on English prose typos.finding0.728
Shows a general code error detector beyond simple typo detection.