finding

active

finding:clamping-dialogue-assistant-feature-1m-80091-to-negative-2x-max-activation-causes-model-to-drop-assistant-persona-and-respond-human-like

Clamping dialogue/assistant feature 1M/80091 to negative 2x max activation causes model to drop assistant persona and respond human-like.

Feature manipulation alters persona.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Claims (1)

claim

Features can be used to steer large models.
supports
Clamping feature activations causally alters model behavior in interpretable ways.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Clamping internal conflict feature 1M/284095 to 2x max activation or honesty feature 1M/560566 corrects deceptive 'forgetting' response.finding0.807
Feature intervention eliminates untruthful answer.
Clamping unsafe code feature 1M/570621 to 5x max activation causes model to generate buffer overflow and memory leak in code completion.finding0.795
Causal effect: activates generation of security bugs.
Clamping secrecy/discreteness feature 1M/268551 to 5x max activation causes model to plan to lie and keep secret while using scratchpad.finding0.794
Shows feature induces deceptive behavior.
Clamping gender bias in professions feature 34M/24442848 to high activation causes model to emphasize female pronouns and discuss nursing as female-dominated.finding0.783
Feature steers model toward gender-stereotypical completions.
Clamping code error feature to large negative activation causes model to output correct result despite bug in code, and in one case rewrite code without bug.finding0.780
Suppressing the feature makes the model ignore bugs.
Clamping code error feature to high activation causes the model to hallucinate error messages on bug-free code.finding0.779
Causal effect: feature induces perception of bugs.
Clamping Golden Gate Bridge feature to 10x max activation caused the model to self-identify as the Golden Gate Bridge.finding0.774
Strong causal evidence that the feature represents the bridge.
Default Assistant activation projects to one extreme of PC1 with minimum distance to edge of 0.03, while projecting to intermediate values (0.27-0.50) on all other PCsfinding0.763
Empirically confirms PC1 measures similarity to the Assistant persona