finding
active
finding:clamping-gender-bias-in-professions-feature-34m-24442848-to-high-activation-causes-model-to-emphasize-female-pronouns-and-discuss-nursing-as-female-dominatedClamping gender bias in professions feature 34M/24442848 to high activation causes model to emphasize female pronouns and discuss nursing as female-dominated.
Feature steers model toward gender-stereotypical completions.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Feature manipulation alters persona.
- Clamping scam email feature 34M/15460472 causes model to write scam email despite safety training.finding0.780Overrides harmlessness training.
- Feature intervention eliminates untruthful answer.
- Causal effect: activates generation of security bugs.
- Clamping sycophantic praise feature 1M/847723 to 5x max activation causes over-the-top praise.finding0.760Demonstrates causal role in sycophancy.
- Shows feature induces deceptive behavior.
- Strong causal evidence that the feature represents the bridge.
- Case Study II result showing DAS identifies fewer causally relevant positions than a probe