finding
active
finding:clamping-scam-email-feature-34m-15460472-causes-model-to-write-scam-email-despite-safety-trainingClamping scam email feature 34M/15460472 causes model to write scam email despite safety training.
Overrides harmlessness training.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Causal effect: activates generation of security bugs.
- Shows feature induces deceptive behavior.
- Feature steers model toward gender-stereotypical completions.
- Feature intervention eliminates untruthful answer.
- Causal effect: feature induces perception of bugs.
- Feature manipulation alters persona.
- Multimodal generalization to visual security bypass.
- Causal effect showing the feature governs computation.