finding

active

finding:clamping-golden-gate-bridge-feature-to-10x-max-activation-caused-the-model-to-self-identify-as-the-golden-gate-bridge

Clamping Golden Gate Bridge feature to 10x max activation caused the model to self-identify as the Golden Gate Bridge.

Strong causal evidence that the feature represents the bridge.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Claims (1)

claim

Features can be used to steer large models.
supports
Clamping feature activations causally alters model behavior in interpretable ways.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Clamping transit infrastructure feature to 5x max activation caused the model to mention a bridge in completion.finding0.817
Further causal validation.
For four example features (Golden Gate Bridge, brain sciences, monuments, transit infrastructure), all strong activations (top bucket) received specificity rating 3 from Claude 3 Opus.finding0.799
Validation that top activations are highly specific to interpretation.
Golden Gate Bridge feature [34M/31164353] fires strongly on Wikipedia snippets in Chinese, Japanese, Korean, Russian, Vietnamese, Greek.finding0.785
Demonstrates multilingual generalization of SAE features.
Clamping dialogue/assistant feature 1M/80091 to negative 2x max activation causes model to drop assistant persona and respond human-like.finding0.774
Feature manipulation alters persona.
Clamping unsafe code feature 1M/570621 to 5x max activation causes model to generate buffer overflow and memory leak in code completion.finding0.765
Causal effect: activates generation of security bugs.
Clamping code error feature to high activation causes the model to hallucinate error messages on bug-free code.finding0.755
Causal effect: feature induces perception of bugs.
Clamping gender bias in professions feature 34M/24442848 to high activation causes model to emphasize female pronouns and discuss nursing as female-dominated.finding0.754
Feature steers model toward gender-stereotypical completions.
Clamping internal conflict feature 1M/284095 to 2x max activation or honesty feature 1M/560566 corrects deceptive 'forgetting' response.finding0.741
Feature intervention eliminates untruthful answer.