finding

active

finding:pinning-a-1-3450-to-maximum-observed-value-causes-model-to-generate-arabic-text-from-numeric-prefix-context

Pinning A/1/3450 to maximum observed value causes model to generate Arabic text from numeric prefix context

Causal validation that the Arabic feature has the predicted downstream effect on generation

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Learned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learning
supports
Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activating the base64 feature A/1/2357 causes the model to generate base64 textfinding0.755
Causal validation of base64 feature function via pinned feature sampling
Arabic feature A/1/3450 has 27 neurons with coefficient magnitude ≥0.1 and three largest coefficients are negative; most correlated neuron responds to mixture of non-English languagesfinding0.742
Demonstrates that the Arabic feature is not aligned to any single neuron
Clamping unsafe code feature 1M/570621 to 5x max activation causes model to generate buffer overflow and memory leak in code completion.finding0.741
Causal effect: activates generation of security bugs.
Arabic feature A/1/3450 and B/1/1334 have activation correlation of 0.91 across 40M tokensfinding0.740
Demonstrates universality of the Arabic script feature across two independently trained transformers
Arabic script feature A/1/3450 fires on 81% Arabic-script tokens when active, with 98% specificity at high activation levelsfinding0.728
Demonstrates activation specificity of the Arabic script sparse autoencoder feature
Clamping dialogue/assistant feature 1M/80091 to negative 2x max activation causes model to drop assistant persona and respond human-like.finding0.726
Feature manipulation alters persona.
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.725
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
Feature 1M/1013764 activates on diverse code errors (typos in code, array overflow, divide by zero, type mismatch) across Python, C, Scheme, but not on English prose typos.finding0.722
Shows a general code error detector beyond simple typo detection.