finding
active
finding:clamping-dialogue-assistant-feature-1m-80091-to-negative-2x-max-activation-causes-model-to-drop-assistant-persona-and-respond-human-likeClamping dialogue/assistant feature 1M/80091 to negative 2x max activation causes model to drop assistant persona and respond human-like.
Feature manipulation alters persona.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Claims (1)
claim
- Clamping feature activations causally alters model behavior in interpretable ways.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Feature intervention eliminates untruthful answer.
- Causal effect: activates generation of security bugs.
- Shows feature induces deceptive behavior.
- Feature steers model toward gender-stereotypical completions.
- Suppressing the feature makes the model ignore bugs.
- Causal effect: feature induces perception of bugs.
- Strong causal evidence that the feature represents the bridge.
- Empirically confirms PC1 measures similarity to the Assistant persona