finding
active
finding:pinning-a-1-3450-to-maximum-observed-value-causes-model-to-generate-arabic-text-from-numeric-prefix-contextPinning A/1/3450 to maximum observed value causes model to generate Arabic text from numeric prefix context
Causal validation that the Arabic feature has the predicted downstream effect on generation
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Causal validation of base64 feature function via pinned feature sampling
- Demonstrates that the Arabic feature is not aligned to any single neuron
- Causal effect: activates generation of security bugs.
- Arabic feature A/1/3450 and B/1/1334 have activation correlation of 0.91 across 40M tokensfinding0.740Demonstrates universality of the Arabic script feature across two independently trained transformers
- Demonstrates activation specificity of the Arabic script sparse autoencoder feature
- Feature manipulation alters persona.
- Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
- Shows a general code error detector beyond simple typo detection.